Wednesday 31 March 2010

R-LAMP Installation Information

I recently installed R with rapache on CentOS within a virtual machine, along with the rest of the LAMP stack.  I've written up the details on my new R-LAMP blog.

Monday 29 March 2010

Now with Mango Solutions AG

After 15 years working on S-PLUS, I've moved over to Mango Solutions.  I'm one of the founders of the Swiss subsidiary Mango Solutions AG, which is based in Basel.  Mango is a data analysis consulting firm that does a lot of work with R and Java, so the nature of my work hasn't changed dramatically.  


I'll update the "About the Author" box on the right whenever I manage to find the "Layout" tab in the blog management interface.  It seems to have gone missing for this blog.

Friday 2 October 2009

web2py

I have a new favorite web framework, among the many web frameworks that I've never had an opportunity to use in anger. web2py is a full-stack framework written in Python. It seems to be pretty comprehensive and easy to use.

I've started to look at it because I've written some Spotfire extensions that use R for computation and some people have expressed interest in a server version.

General criteria are:
  • Linux and Windows
  • Extensible by modifying text files
  • Accessible and understandable by a typical R user
  • Self-contained and easy to install
I first took a look at using either Rapache or biocep, which are my current favorite R server solutions. Admittedly I haven't used either of them either.

Since Rapache doesn't have a Windows version, it needs to be run within a VM. I really like the idea of using standard Apache server configuration procedures along with either standard R scripts or "brew" templates. But I lost patience with working in a Linux terminal within a VMPlayer window.

Biocep seems to be pretty rich and I've heard good things about it. But it's lacking in documentation. It also is focused on managing R engines rather than on a whole web stack. There's nothing wrong with that, it just isn't what I was looking for.

So after revisiting the R server options, I decided to look at other languages in which server development is more front-and-center. Since Linux is of interest, I set aside ASP.NET. As usual, Java offers a plethora of options but there's bunches of XML configuration involved in most of them and the learning curve is steep.

Turning back to the Python offerings, the web2py framework stands out as being self-contained, full-featured, and very approachable. As it's written and configured in Python, it also seems to be approachable to R programmers.

In my particular use case I need to invoke R batch jobs and do some file management operations from Spotfire which is a .NET application. I like keeping simple things simple, so I favor using XML + HTTP for this. Perhaps it's too naive, but this keeps the client and server very much technology neutral. Authentication, encryption, etc. is then handled at the HTTP transport level via tried-and-true techniques.

web2py fits really well with this use case. You can define an XML-RPC service by simply adding an annotation to a Python method. The whole controller class for my "RunR()" service is below:

from gluon.tools import Service
service = Service(globals())

def call():
return service()

def index(): return dict()

@service.xmlrpc
def RunR(scriptText, dirKey):
return "Script executed in directory " + dirKey

Of course, the RunR() method needs to do more than just return a string. My thinking is that it'll invoke "R CMD BATCH" using the Python "subprocess" package. This is essentially what I'm doing from Spotfire locally using the "Process" class in C#.

It turns out that it's easy to implement an XML-RPC client in C#. The XML-RPC.NET assembly from Cook Computing makes it a simple matter of declaring an interface:

using CookComputing.XmlRpc;

namespace CalculationExample
{
public interface IRManager : IXmlRpcProxy
{
[XmlRpcMethod]
string RunR(string scriptText,
string sessionDirKey);
}
}

You then use a factory method to construct an object of this type and work with it as a standard C# object.

This is basically the point I've reached. On the client side, there's a bit of C# work that's pretty easy to have the code use the local or remote methods based on an option setting. On the server side, there's a bit of Python work with "subprocess" and file management methods. As I've done lots of C# and little to no Python the first task is easy and the second more challenging.

Goodbye www.insightful.com

I've recently noticed that the old Insightful web site has been taken down. Browsing to http://www.insightful.com now redirects to the Spotfire web site.

The good news is the Spotfire web site now lists all of the currently available S+ products. The bad news is all the other information that was up there is hard or impossible to find.

Happily, at least some of the content is available via the "Internet Archive Wayback Machine" at http://web.archive.org/web/*/http://www.insightful.com.

Wednesday 1 April 2009

RColorBrewer for S+

Given the work to make S+ colors compatible with R in S+ 8, it was trivial to port RColorBrewer to S+. I've submitted this to CSAN using the procedure described in the S+ 8 Guide to Packages. It'll be interesting to see what the lag time is to get this up on CRAN.

The motivation for the port is I'm doing more with the-code-formerly-known-as-ArrayAnalyzer lately, and the Bioconductor Case Studies book likes to use RColorBrewer in examples.

---------

HOW TO SUBMIT A PACKAGE TO CSAN

You can share your S-PLUS package with other users within your
department, company, or university. Just send them the package
archive, and have them install it with the INSTALL script or the
install.packages function (setting repos=NULL).

If you want to share your package with the entire S-PLUS community,
you can submit your package for inclusion in the Comprehensive S
Archive Network (CSAN). To submit a package, upload the source
package archive (the result of running Splus CMD build) to:

ftp://ftp.insightful.com/public/incoming/packages

Once you have uploaded your file, send a message to

packages@insightful.com

stating the name of the package archive you submitted.

Before submitting a package for inclusion in CSAN, it should pass the
check utility. Make sure these key fields in the DESCRIPTION file
have appropriate values: Package, Title, Version, Author,
Maintainer, and License. If any of these are missing, your package
will not be posted to CSAN.

Insightful will review your submitted package, run the check utility,
and create a Windows binary archive. If everything passes, the
package is posted to the CSAN site. Any problems with the package
are sent to the package submitter.

Wednesday 18 March 2009

R to S+ Package Ports

Lately I've been doing a lot of experimental package ports from R to S+. I say "experimental" in that there are a few of them underway but not yet completed.

I'm asked regularly how hard it'll be to port a particular package from R to S+. The general answer is that the basic mechanics of it are easy. If the R code is basically using things already in S+ and any C/Fortran code is just working on arrays, then things may port with few changes.

Some items that make a port more difficult are:
  1. Extensive usage of functions that aren't available in S+ and aren't easily ported to S+. The main place this has come up in code I've seen lately is usage of the "grid" graphics.
  2. Usage of more advanced C macros to manipulate R objects at the C level. That is, code using .Call() rather than .C().
  3. The only item that's a real showstopper is use of external pointers with a finalizer. S+ doesn't have a way of calling a C function when and S object is released to do finalization, so you can't do things like having an S object with a reference to a Java object. There's no way to know when to free the Java object. I'm still trying to figure out a workaround for this.

To do R to S+ ports, the first step is to get set up to build S+ packages. This is described in the "Guide to Packages" included with S+. On Linux, you'll be good to go if you have S+ installed and standard tools such as perl, gcc, and gfortran. On Windows, it's a bit more involved.

I've had a pro-Windows bias for many years but I've recently switched to doing the ports on Linux. The main reason is I no longer have a copy of "Visual Fortran", which is required to build Fortran code on Windows. Perhaps I'll get a copy of this installed, or perhaps I'll stick with Linux.

The basic steps involved for a port are:

  1. Put the files in the standard structure for an S+ package. This matches the structure of an R package, so if you are starting with the R package source you can just unzip it.
  2. Modify the DESCRIPTION file to adjust the package dependencies, e.g. add a "DependsSplus" line that's referring to S+ packages rather than R packages.
  3. Run "Splus CMD build -binary [pkgname]" from an OS command line. You'll repeat this over and over. Ideally things will build right away. If not you'll need to modify the source code until it does. For this listing, let's assume the S code is syntactically correct and the C code compiles so we can proceed to the next step. If the C code is failing, move it aside until the S code has been fixed up.
  4. In a separate window start S-PLUS. Use "library(pkgutils)" to load the package utility functions.
  5. Use "unresolvedGlobalReferences([R code dir name])" to get a list of objects that will not be found under S+ scoping rules. This is an invaluable tool. The objects not found are usually either misspelled object/function names, functions available in R and not in S+, or local variables that need to be explicitely passed to inner functions. The next step is to modify the S code to resolve the missing references.
  6. The first step I take on resolving the references is to check which are references to R functions not in S+. Then I put in stub functions that just call "stop()".
  7. The second step is I go through the code fixing misspellings and modifying calls to anonymous functions used in "apply()" to explicitely pass values that are used in the inner functions.
  8. The third step of changes related to object scoping is to change assign() statements so that instead of assigning to ".GlobalEnv" they assign to "where=1" when the intent is to maintain a global variable. Potentially you can store global objects in "frame=0" instead, but it isn't garbage collected very aggressively so this can lead to memory buildup.
  9. At this point in theory the S code builds, scoping problems are fixed, and we've identified missing functions. Now the missing functions need to either be implemented or replaced with calls to other S+ functions.
  10. If the C code was failing to compile, move it back into place and fix the problems in the C code. This can be either easy or horribly hard depending on how complicated the code is.
  11. Now you're ready to test functionality using examples from the help files. At this point you'll identify differences in behavior or arguments between R and S+ functions of the same name.
  12. Repeat until everything works.

So I'm starting to get a routine in place. The only part I find difficult is the C stuff, but that's because I don't do a lot of C programming and I get rusty between uses.

SPSS, Python, and R

I haven't really kept up with SPSS over the years. It turns out they've embraced both Python and R for SPSS programming. Here's a blog posting on why they chose Python.

I'm cooling a little on Python at the moment as I haven't come across an opportunity to use it professionally (yet). However, I think it's interesting that SPSS came to the same conclusion as I did regarding its suitability as a scripting language for the statistical audience.