Wednesday, 18 March 2009

R to S+ Package Ports

Lately I've been doing a lot of experimental package ports from R to S+. I say "experimental" in that there are a few of them underway but not yet completed.

I'm asked regularly how hard it'll be to port a particular package from R to S+. The general answer is that the basic mechanics of it are easy. If the R code is basically using things already in S+ and any C/Fortran code is just working on arrays, then things may port with few changes.

Some items that make a port more difficult are:
  1. Extensive usage of functions that aren't available in S+ and aren't easily ported to S+. The main place this has come up in code I've seen lately is usage of the "grid" graphics.
  2. Usage of more advanced C macros to manipulate R objects at the C level. That is, code using .Call() rather than .C().
  3. The only item that's a real showstopper is use of external pointers with a finalizer. S+ doesn't have a way of calling a C function when and S object is released to do finalization, so you can't do things like having an S object with a reference to a Java object. There's no way to know when to free the Java object. I'm still trying to figure out a workaround for this.

To do R to S+ ports, the first step is to get set up to build S+ packages. This is described in the "Guide to Packages" included with S+. On Linux, you'll be good to go if you have S+ installed and standard tools such as perl, gcc, and gfortran. On Windows, it's a bit more involved.

I've had a pro-Windows bias for many years but I've recently switched to doing the ports on Linux. The main reason is I no longer have a copy of "Visual Fortran", which is required to build Fortran code on Windows. Perhaps I'll get a copy of this installed, or perhaps I'll stick with Linux.

The basic steps involved for a port are:

  1. Put the files in the standard structure for an S+ package. This matches the structure of an R package, so if you are starting with the R package source you can just unzip it.
  2. Modify the DESCRIPTION file to adjust the package dependencies, e.g. add a "DependsSplus" line that's referring to S+ packages rather than R packages.
  3. Run "Splus CMD build -binary [pkgname]" from an OS command line. You'll repeat this over and over. Ideally things will build right away. If not you'll need to modify the source code until it does. For this listing, let's assume the S code is syntactically correct and the C code compiles so we can proceed to the next step. If the C code is failing, move it aside until the S code has been fixed up.
  4. In a separate window start S-PLUS. Use "library(pkgutils)" to load the package utility functions.
  5. Use "unresolvedGlobalReferences([R code dir name])" to get a list of objects that will not be found under S+ scoping rules. This is an invaluable tool. The objects not found are usually either misspelled object/function names, functions available in R and not in S+, or local variables that need to be explicitely passed to inner functions. The next step is to modify the S code to resolve the missing references.
  6. The first step I take on resolving the references is to check which are references to R functions not in S+. Then I put in stub functions that just call "stop()".
  7. The second step is I go through the code fixing misspellings and modifying calls to anonymous functions used in "apply()" to explicitely pass values that are used in the inner functions.
  8. The third step of changes related to object scoping is to change assign() statements so that instead of assigning to ".GlobalEnv" they assign to "where=1" when the intent is to maintain a global variable. Potentially you can store global objects in "frame=0" instead, but it isn't garbage collected very aggressively so this can lead to memory buildup.
  9. At this point in theory the S code builds, scoping problems are fixed, and we've identified missing functions. Now the missing functions need to either be implemented or replaced with calls to other S+ functions.
  10. If the C code was failing to compile, move it back into place and fix the problems in the C code. This can be either easy or horribly hard depending on how complicated the code is.
  11. Now you're ready to test functionality using examples from the help files. At this point you'll identify differences in behavior or arguments between R and S+ functions of the same name.
  12. Repeat until everything works.

So I'm starting to get a routine in place. The only part I find difficult is the C stuff, but that's because I don't do a lot of C programming and I get rusty between uses.

No comments:

Post a Comment