Wednesday 18 March 2009

R to S+ Package Ports

Lately I've been doing a lot of experimental package ports from R to S+. I say "experimental" in that there are a few of them underway but not yet completed.

I'm asked regularly how hard it'll be to port a particular package from R to S+. The general answer is that the basic mechanics of it are easy. If the R code is basically using things already in S+ and any C/Fortran code is just working on arrays, then things may port with few changes.

Some items that make a port more difficult are:
  1. Extensive usage of functions that aren't available in S+ and aren't easily ported to S+. The main place this has come up in code I've seen lately is usage of the "grid" graphics.
  2. Usage of more advanced C macros to manipulate R objects at the C level. That is, code using .Call() rather than .C().
  3. The only item that's a real showstopper is use of external pointers with a finalizer. S+ doesn't have a way of calling a C function when and S object is released to do finalization, so you can't do things like having an S object with a reference to a Java object. There's no way to know when to free the Java object. I'm still trying to figure out a workaround for this.

To do R to S+ ports, the first step is to get set up to build S+ packages. This is described in the "Guide to Packages" included with S+. On Linux, you'll be good to go if you have S+ installed and standard tools such as perl, gcc, and gfortran. On Windows, it's a bit more involved.

I've had a pro-Windows bias for many years but I've recently switched to doing the ports on Linux. The main reason is I no longer have a copy of "Visual Fortran", which is required to build Fortran code on Windows. Perhaps I'll get a copy of this installed, or perhaps I'll stick with Linux.

The basic steps involved for a port are:

  1. Put the files in the standard structure for an S+ package. This matches the structure of an R package, so if you are starting with the R package source you can just unzip it.
  2. Modify the DESCRIPTION file to adjust the package dependencies, e.g. add a "DependsSplus" line that's referring to S+ packages rather than R packages.
  3. Run "Splus CMD build -binary [pkgname]" from an OS command line. You'll repeat this over and over. Ideally things will build right away. If not you'll need to modify the source code until it does. For this listing, let's assume the S code is syntactically correct and the C code compiles so we can proceed to the next step. If the C code is failing, move it aside until the S code has been fixed up.
  4. In a separate window start S-PLUS. Use "library(pkgutils)" to load the package utility functions.
  5. Use "unresolvedGlobalReferences([R code dir name])" to get a list of objects that will not be found under S+ scoping rules. This is an invaluable tool. The objects not found are usually either misspelled object/function names, functions available in R and not in S+, or local variables that need to be explicitely passed to inner functions. The next step is to modify the S code to resolve the missing references.
  6. The first step I take on resolving the references is to check which are references to R functions not in S+. Then I put in stub functions that just call "stop()".
  7. The second step is I go through the code fixing misspellings and modifying calls to anonymous functions used in "apply()" to explicitely pass values that are used in the inner functions.
  8. The third step of changes related to object scoping is to change assign() statements so that instead of assigning to ".GlobalEnv" they assign to "where=1" when the intent is to maintain a global variable. Potentially you can store global objects in "frame=0" instead, but it isn't garbage collected very aggressively so this can lead to memory buildup.
  9. At this point in theory the S code builds, scoping problems are fixed, and we've identified missing functions. Now the missing functions need to either be implemented or replaced with calls to other S+ functions.
  10. If the C code was failing to compile, move it back into place and fix the problems in the C code. This can be either easy or horribly hard depending on how complicated the code is.
  11. Now you're ready to test functionality using examples from the help files. At this point you'll identify differences in behavior or arguments between R and S+ functions of the same name.
  12. Repeat until everything works.

So I'm starting to get a routine in place. The only part I find difficult is the C stuff, but that's because I don't do a lot of C programming and I get rusty between uses.

SPSS, Python, and R

I haven't really kept up with SPSS over the years. It turns out they've embraced both Python and R for SPSS programming. Here's a blog posting on why they chose Python.

I'm cooling a little on Python at the moment as I haven't come across an opportunity to use it professionally (yet). However, I think it's interesting that SPSS came to the same conclusion as I did regarding its suitability as a scripting language for the statistical audience.

Wednesday 11 March 2009

Resolver One 1.4: IronPython 2, Numpy, and R

Resolver One is an IronPython based spreadsheet. I'm still getting my head around whether there's a big enough market for an Excel replacement to support a company, but in any case it's very cool.

They have just released Resolver One 1.4 based on IronPython and includes Numpy support.

They are also trumpeting R integration. This is done using R(D)COM accessed from IronPython. Now if we just had a Mono implementation of R(D)COM to get this approach working on Linux and Mac...

Tuesday 10 March 2009

Creating a Linux VM with EasyVMX and VMWare Player

When working out of the Seattle office of Insightful I had easy access to a wide range of Linux/UNIX boxes on the network. Hence I never got around to installing Linux for my own use.

The network access between Basingstoke and Seattle has gotten a little challenging with various IT changes, so it's time to get a local version of Linux on my machine. While I can imagine using a dual-boot configuration someday, at this point a virtual machine meets my needs.

It turns out that it's pretty straightforward to create a Linux virtual machine that runs on VMware Player. The pieces needed are:

  1. The EasyVMX web site to create the virtual machine description and virtual hard drive files.
  2. The VMware Player application.
  3. The Linux installation disk. I used the LiveISO version of Fedora 10.

With EasyVMX you just specify the OS type, RAM size, hard disk size, ISO file name, and various other optional settings. This then creates a zip with files representing the virtual hardware configuration for the requested specifications. If you specified the ISO file name, the machine will have that ISO loaded on startup.

Download, save, and unzip the zip file. Then open the "vmx" file with VMware Player. The LiveISO version of Fedora will start. This includes an option to install the software to the hard drive, which I did.

After you've installed the software, it'll prompt you to install a bunch of updates. I let it install all of them so it wouldn't bother me about them in the future.

Then you can install additional packages with the "yum" utility. There's a GUI application for software updates, but I went for the command line version. To see whether packages are available for a topic, use "yum list" with "grep":


     yum list | grep R 

To install a package, use "yum install":


     yum install R 

To install S+, download the images from the TIBCO download site and follow the standard S+ installation procedures.

Friday 6 March 2009

Dynamic Languages: R with Python

Previously I've discussed using Python as a primary general-purpose language calling S as a statistics and graphics engine. The capabilities for doing this have been improving over the years. First there was RSPython by Duncan Temple Lang, then rpy by Greg Warnes, and most recently rpy2 by Laurent Gautier.

Here's an example using rpy2 for principal components analysis:


import rpy2.robjects as robjects

r = robjects.r

m = r.matrix(r.rnorm(100), ncol=5)
pca = r.princomp(m)
r.plot(pca, main="Eigen values")
r.biplot(pca, main="biplot")




It's pretty readable for an S programmer, so this looks promising.

Unfortunately it doesn't currently seem to work with IronPython via IronClad. Perhaps someday...

Dynamic Languages: Python

Over the past few years I've thought a lot about how S compares with other programming languages. As discussed in a previous post, S is head and shoulders above general purpose programming languages in terms of the built-in statistical and graphics routines.

Having said that, S is a bit thread-bare in terms of general purpose infrastructure routines. I've spent entirely too much of my life writing S wrappers of Java routines to do things like:

  • Zip and unzip files

  • Create and parse XML

  • HTTP client operations


Various packages to do these sorts of things are around, but it's fair to say that the implementations are typically not as full featured as one would have in Java, C#, or Python. Also, some poor soul probably had to do the work of writing C code to interface R to a C library with the capabilities.

In a recent product development cycle we wanted to have a modern Windows GUI written in C# talking to a server written in Java that called a graphics system written in S. Oh, and the graphics device used C to create a graph file as well as Java to generate things like PDF from that file. It sounds a little crazy, but was driven by the desire to have a Windows rich client with a cross-platform server plus extensibility in S.

This meant jumping back and forth between multiple IDE's and some tricky debugging. It also meant essentially implementing some things twice: once in Java and once in C#. It left me longing for a system that could meet the following requirements without so many languages:
  • Cross-platform in terms of both operating systems and virtual machines. Runs on Windows, UNIX, Linux, perhaps Mac. Runs in the .NET CLR and the Java VM.

  • Has a rich standard library for string manipulation, network operations, etc. That is, it can do the bread-and-butter stuff needed by programs.

  • Uses an interpreted, loosely typed language that provides the same productivity benefits as S. Approachable and quick to learn for S programmers so that perhaps they could use it rather than S for extensibility.

Basically I wanted to have a single language that I could use to do .NET programming for the client, Java programming for the server, and ideally the type of programming one typically does in S.

The language that rose to the top was Python. The standard Python implementation is CPython which has been ported to oodles of platforms. There's also IronPython which runs on the .NET CLR and Jython that runs on the Java VM. With Mono and the Dynamic Language Runtime (DLR) it even runs on Android (Google's OS for cell phones).

I looked at bunches of other dynamic languages including Ruby, F#, Groovy, etc. None really had the right "feel" in terms of being something approachable to S programmers. Python has a sensible syntax that's easy to learn and follow. The most common complaint against it is the use of indentation as a flow-control mechanism rather than braces, and of course that an interpreted language typically isn't as quick as well-written C code.

The main limitation of IronPython and Jython is that while they can run pure Python modules they can't handle modules using C, with the most requested one being numpy. The guys at Resolver Systems are working to remedy this for IronPython via the IronClad project.

The big downside of Python by itself for S users is that it doesn't have the rich statistics and graphics available in S. So the thought of using Python alone rather than S is a non-starter. But I do think there's potential for using Python as a primary general-purpose language together with R as a statistics engine.

Not that I expect I'll be doing Python rather than C# and Java anything soon, but perhaps as the DLR matures it'll become a viable option.

Dynamic Languages: R in Ruby

One of the "hot" languages in the past few years is Ruby. My impression actually is that Ruby may have peaked a bit with the buzz for "Ruby on Rails". The good ideas from "Rails" are getting implemented in bunches of other frameworks such as MonoRail, Grails, etc.

But hey, I may be wrong. For Ruby proponents interested in some statistics a cool project is RinRuby. This provides Ruby functionality for invoking R as an embedded statistics engine.

Here's an example creating a graph and printing a correlation value from Ruby:

tally = Hash.new(0)
File.open('gettysburg.txt').each_line do |line|
line.downcase.split(/\W+/).each { |w| tally[w] += 1 }
end
total = tally.values.inject { |sum,count| sum + count }
tally.delete_if { |key,count| count < 3 || key.length < 4 }

require "rinruby"
R.keys, R.counts = tally.keys, tally.values

R.eval <<EOF
names(counts) <- keys
barplot(rev(sort(counts)),main="Frequency of Non-Trivial Words",las=2)
mtext("Among the #{total} words in the Gettysburg Address",3,0.45)
rho <- round(cor(nchar(keys),counts),4)
EOF

puts "The correlation between word length and frequency is #{R.rho}."



One of the cool things about RinRuby is that the implementation uses pure Ruby code, e.g. there's no C code behind the scenes. The Implementation Details explain:

RinRuby is a program which allows the user to run R commands from Ruby. An instance of R is created as a new object within Ruby, which allows R to remain open and running until the user closes the connection. There is no software that needs to be installed in R. Ruby sends data to R over TCP/IP network sockets, while commands and text are passed through the pipe. The pipe avoids compatibility issues on differing operating systems, platforms, and versions of Ruby and R, while the socket can handle large amounts of data quickly while avoiding rounding issues for doubles.

A benefit of this is that it is expected to work in JRuby and IronRuby.

This installed easily for me but didn't detect the location of R. I probably don't have the location of R in my registry. As a testament to the benefits of a scripting language I could fix this by editing the appropriate line in the "rinruby.rb" file to hardcode the location of R for my machine.

It ran properly under the Windows console application "irb". It gave an error under the "fxri" application which provides a console along with help information. I suspect that when running in a GUI the "rinruby" code is fighting it out with the GUI regarding redirection of stdout. Or it might be that a Windows console application has the stdout pipe and a Windows GUI application doesn't.

Dynamic Languages: Why S?

In the distant past I used BASIC, Pascal, APL, and Fortran. Then I started using S as a grad student, and hence C in order to speed up some of the computations. In the Java 1.2.1 era I took up Java, and around the introduction of C# 3 started doing some .NET programming.

In the past few years I've thought a lot about what the unique strengths are of S implementations (by which I mean R and S+).

If you cast your mind back to the dawn of S, the general purpose language commonly available for scientific computing was Fortran. To make Fortran a little more productive, SLAC created MORTRAN and Bell Labs created RATFOR. A crazy new language called C was also gaining traction.

The systems available for statistical computing were procedural languages such as SAS and SPSS. (SAS was implemented in PL/I.)

So imagine yourself doing statistical analysis in that environment. You don't have any of these crazy point-and-click things that we have today. You're very likely used to submitting your job to the computing center, going away for a bit, and then dropping by to pick up the line printer output later. Or perhaps you've moved a step up the technology chain and now have an interactive terminal to a timesharing system so you can run your batch job and immediately get results back on the screen.

Pretty frustrating for exploratory data analysis, eh? Suddenly along comes S where you can much more rapidly analyze your data. You can read in some text, view some summaries, and graph the results as iterative steps in an analysis. Even better, you can create your own functions to perform particular analyses without the complexity of the write/compile/run cycle needed with C and Fortran.

S really was special at the time in terms of the additional productivity for exploratory data analysis compared to the alternatives.

I'm sort of glossing over the time differences between the release of Sv1, Sv2, and Sv3. S as we know it today is pretty much an evolutionary improvement on Sv3 as described in "The New S Language" (1988). Feel free to clarify or comment on inaccuracies in the blog comments.

Well, that's fine and dandy but that was 20 years ago and things change. How does S stack up today?

The primary strengths I see in S engine implementations are:
  • A loosely typed dynamic language is much more productive to program in than a compiled language such as C, Fortran, Java, or C#. You can find any number of articles by proponents of Python or Ruby making the case for dynamic languages as productivity tools. In fact, Microsoft has put a lot of energy into IronPython and IronRuby in the last few years, and better support for dynamic languages is one of the primary enhancements planned for C# 4.

  • The range of statistical techniques that are available is unparalleled. SAS is in the same league and probably still dominates for mixed effect models, but the S implementations are ahead in other areas such as MCMC. C# and Java really don't go very far in terms of statistical capabilities. Python does a little better with numpy and scipy but more in terms of techniques for engineering than for general statistics. In all honesty I don't know where Stata, SPSS, Statistica, etc. rank these days but my impression is they have strengths in focused areas without the breadth of S+ and R.

  • Strongly integrated with the statistics is a world-class presentation graphics system. The C#/Java/Excel world is full of charting packages to do simple scatter plots, bar charts, and pie charts but that's pretty much as far as it goes. S has a much wider range of statistical graph types such as boxplots, and the ability to customize the graphs extensively. I also think the graphs just downright look good compared to the output of many graphing packages. It's sort of like comparing an equation from Latex with one from the Word equation editor.

In all honesty, I don't think the first point is a differentiating strength of S anymore. Sure, S is still more productive than C and Fortran. I'm even willing to argue that it's better for statistical computing than other older interpreted languages such as Lisp, Scheme, and Smalltalk. But I don't think it's distinguished itself as superior to the likes of Python and Ruby in terms of the language itself.

Providing a dynamic language is no longer a differentiating feature that makes S special. What does make it special is that it still provides a wealth of statistical and graphical capabilities that aren't available in other languages. Putting it another way, it isn't the language that makes S valuable, it's the statistics and graphics class libraries.

The motivation for this discussion is that by acknowledging the strengths and weaknesses of S as a language one can start to explore combining S with other dynamic languages that are strong in areas where S is weaker. Future posts will elaborate on this topic.

S+ Tips: startsWith() and endsWith()

Whenever I go from using C# or Java back to using S, one of the annoyances is that the standard string methods available in most languages aren't around. That is, if one wants something more convenient than straight calls to grep().

I usually end up defining some wrappers to let me do "starts with" and "ends with" operations on single string values:


"strStartsWith" <- function(str, prefix){
# Test whether a string starts with a particular prefix.
# case insensitive
if (is.null(str) !is.character(str) length(str) != 1){
stop("The 'str' argument must be a single string value.")
}
(length(grep(paste("^", prefix, sep=""), str, ignore.case = TRUE)) > 0)
}

"strEndsWith" <- function(str, suffix){
# Test whether a string ends with a particular suffix.
# case insensitive
if (is.null(str) !is.character(str) length(str) != 1){
stop("The 'str' argument must be a single string value.")
}
(length(grep(paste(suffix, "$", sep=""), str, ignore.case=TRUE)) > 0)
}




To make these polished S functions it would be better to make them vectorized rather than throwing the error.

I considered implementing these by calling into Java, but went with the grep() approach so they could be used without loading Java. The downside of using Java is just that the startup time for a Java application is significantly slower than for a straight C application, and this in turn leads to slower startup of S+ when Java is loaded.

Another thing to note is that I used the name strStartsWith() rather than just startsWith(). When actually defining these in a package I'd really use a prefix common to all the functions in the package. For example, in the "fm" package it would be fmStartsWith(). I do this because S+ just has one big namespace, so I want to avoid using up commonly-used names that might have a different purpose in another package.