Friday 2 October 2009

web2py

I have a new favorite web framework, among the many web frameworks that I've never had an opportunity to use in anger. web2py is a full-stack framework written in Python. It seems to be pretty comprehensive and easy to use.

I've started to look at it because I've written some Spotfire extensions that use R for computation and some people have expressed interest in a server version.

General criteria are:
  • Linux and Windows
  • Extensible by modifying text files
  • Accessible and understandable by a typical R user
  • Self-contained and easy to install
I first took a look at using either Rapache or biocep, which are my current favorite R server solutions. Admittedly I haven't used either of them either.

Since Rapache doesn't have a Windows version, it needs to be run within a VM. I really like the idea of using standard Apache server configuration procedures along with either standard R scripts or "brew" templates. But I lost patience with working in a Linux terminal within a VMPlayer window.

Biocep seems to be pretty rich and I've heard good things about it. But it's lacking in documentation. It also is focused on managing R engines rather than on a whole web stack. There's nothing wrong with that, it just isn't what I was looking for.

So after revisiting the R server options, I decided to look at other languages in which server development is more front-and-center. Since Linux is of interest, I set aside ASP.NET. As usual, Java offers a plethora of options but there's bunches of XML configuration involved in most of them and the learning curve is steep.

Turning back to the Python offerings, the web2py framework stands out as being self-contained, full-featured, and very approachable. As it's written and configured in Python, it also seems to be approachable to R programmers.

In my particular use case I need to invoke R batch jobs and do some file management operations from Spotfire which is a .NET application. I like keeping simple things simple, so I favor using XML + HTTP for this. Perhaps it's too naive, but this keeps the client and server very much technology neutral. Authentication, encryption, etc. is then handled at the HTTP transport level via tried-and-true techniques.

web2py fits really well with this use case. You can define an XML-RPC service by simply adding an annotation to a Python method. The whole controller class for my "RunR()" service is below:

from gluon.tools import Service
service = Service(globals())

def call():
return service()

def index(): return dict()

@service.xmlrpc
def RunR(scriptText, dirKey):
return "Script executed in directory " + dirKey

Of course, the RunR() method needs to do more than just return a string. My thinking is that it'll invoke "R CMD BATCH" using the Python "subprocess" package. This is essentially what I'm doing from Spotfire locally using the "Process" class in C#.

It turns out that it's easy to implement an XML-RPC client in C#. The XML-RPC.NET assembly from Cook Computing makes it a simple matter of declaring an interface:

using CookComputing.XmlRpc;

namespace CalculationExample
{
public interface IRManager : IXmlRpcProxy
{
[XmlRpcMethod]
string RunR(string scriptText,
string sessionDirKey);
}
}

You then use a factory method to construct an object of this type and work with it as a standard C# object.

This is basically the point I've reached. On the client side, there's a bit of C# work that's pretty easy to have the code use the local or remote methods based on an option setting. On the server side, there's a bit of Python work with "subprocess" and file management methods. As I've done lots of C# and little to no Python the first task is easy and the second more challenging.

Goodbye www.insightful.com

I've recently noticed that the old Insightful web site has been taken down. Browsing to http://www.insightful.com now redirects to the Spotfire web site.

The good news is the Spotfire web site now lists all of the currently available S+ products. The bad news is all the other information that was up there is hard or impossible to find.

Happily, at least some of the content is available via the "Internet Archive Wayback Machine" at http://web.archive.org/web/*/http://www.insightful.com.

Wednesday 1 April 2009

RColorBrewer for S+

Given the work to make S+ colors compatible with R in S+ 8, it was trivial to port RColorBrewer to S+. I've submitted this to CSAN using the procedure described in the S+ 8 Guide to Packages. It'll be interesting to see what the lag time is to get this up on CRAN.

The motivation for the port is I'm doing more with the-code-formerly-known-as-ArrayAnalyzer lately, and the Bioconductor Case Studies book likes to use RColorBrewer in examples.

---------

HOW TO SUBMIT A PACKAGE TO CSAN

You can share your S-PLUS package with other users within your
department, company, or university. Just send them the package
archive, and have them install it with the INSTALL script or the
install.packages function (setting repos=NULL).

If you want to share your package with the entire S-PLUS community,
you can submit your package for inclusion in the Comprehensive S
Archive Network (CSAN). To submit a package, upload the source
package archive (the result of running Splus CMD build) to:

ftp://ftp.insightful.com/public/incoming/packages

Once you have uploaded your file, send a message to

packages@insightful.com

stating the name of the package archive you submitted.

Before submitting a package for inclusion in CSAN, it should pass the
check utility. Make sure these key fields in the DESCRIPTION file
have appropriate values: Package, Title, Version, Author,
Maintainer, and License. If any of these are missing, your package
will not be posted to CSAN.

Insightful will review your submitted package, run the check utility,
and create a Windows binary archive. If everything passes, the
package is posted to the CSAN site. Any problems with the package
are sent to the package submitter.

Wednesday 18 March 2009

R to S+ Package Ports

Lately I've been doing a lot of experimental package ports from R to S+. I say "experimental" in that there are a few of them underway but not yet completed.

I'm asked regularly how hard it'll be to port a particular package from R to S+. The general answer is that the basic mechanics of it are easy. If the R code is basically using things already in S+ and any C/Fortran code is just working on arrays, then things may port with few changes.

Some items that make a port more difficult are:
  1. Extensive usage of functions that aren't available in S+ and aren't easily ported to S+. The main place this has come up in code I've seen lately is usage of the "grid" graphics.
  2. Usage of more advanced C macros to manipulate R objects at the C level. That is, code using .Call() rather than .C().
  3. The only item that's a real showstopper is use of external pointers with a finalizer. S+ doesn't have a way of calling a C function when and S object is released to do finalization, so you can't do things like having an S object with a reference to a Java object. There's no way to know when to free the Java object. I'm still trying to figure out a workaround for this.

To do R to S+ ports, the first step is to get set up to build S+ packages. This is described in the "Guide to Packages" included with S+. On Linux, you'll be good to go if you have S+ installed and standard tools such as perl, gcc, and gfortran. On Windows, it's a bit more involved.

I've had a pro-Windows bias for many years but I've recently switched to doing the ports on Linux. The main reason is I no longer have a copy of "Visual Fortran", which is required to build Fortran code on Windows. Perhaps I'll get a copy of this installed, or perhaps I'll stick with Linux.

The basic steps involved for a port are:

  1. Put the files in the standard structure for an S+ package. This matches the structure of an R package, so if you are starting with the R package source you can just unzip it.
  2. Modify the DESCRIPTION file to adjust the package dependencies, e.g. add a "DependsSplus" line that's referring to S+ packages rather than R packages.
  3. Run "Splus CMD build -binary [pkgname]" from an OS command line. You'll repeat this over and over. Ideally things will build right away. If not you'll need to modify the source code until it does. For this listing, let's assume the S code is syntactically correct and the C code compiles so we can proceed to the next step. If the C code is failing, move it aside until the S code has been fixed up.
  4. In a separate window start S-PLUS. Use "library(pkgutils)" to load the package utility functions.
  5. Use "unresolvedGlobalReferences([R code dir name])" to get a list of objects that will not be found under S+ scoping rules. This is an invaluable tool. The objects not found are usually either misspelled object/function names, functions available in R and not in S+, or local variables that need to be explicitely passed to inner functions. The next step is to modify the S code to resolve the missing references.
  6. The first step I take on resolving the references is to check which are references to R functions not in S+. Then I put in stub functions that just call "stop()".
  7. The second step is I go through the code fixing misspellings and modifying calls to anonymous functions used in "apply()" to explicitely pass values that are used in the inner functions.
  8. The third step of changes related to object scoping is to change assign() statements so that instead of assigning to ".GlobalEnv" they assign to "where=1" when the intent is to maintain a global variable. Potentially you can store global objects in "frame=0" instead, but it isn't garbage collected very aggressively so this can lead to memory buildup.
  9. At this point in theory the S code builds, scoping problems are fixed, and we've identified missing functions. Now the missing functions need to either be implemented or replaced with calls to other S+ functions.
  10. If the C code was failing to compile, move it back into place and fix the problems in the C code. This can be either easy or horribly hard depending on how complicated the code is.
  11. Now you're ready to test functionality using examples from the help files. At this point you'll identify differences in behavior or arguments between R and S+ functions of the same name.
  12. Repeat until everything works.

So I'm starting to get a routine in place. The only part I find difficult is the C stuff, but that's because I don't do a lot of C programming and I get rusty between uses.

SPSS, Python, and R

I haven't really kept up with SPSS over the years. It turns out they've embraced both Python and R for SPSS programming. Here's a blog posting on why they chose Python.

I'm cooling a little on Python at the moment as I haven't come across an opportunity to use it professionally (yet). However, I think it's interesting that SPSS came to the same conclusion as I did regarding its suitability as a scripting language for the statistical audience.

Wednesday 11 March 2009

Resolver One 1.4: IronPython 2, Numpy, and R

Resolver One is an IronPython based spreadsheet. I'm still getting my head around whether there's a big enough market for an Excel replacement to support a company, but in any case it's very cool.

They have just released Resolver One 1.4 based on IronPython and includes Numpy support.

They are also trumpeting R integration. This is done using R(D)COM accessed from IronPython. Now if we just had a Mono implementation of R(D)COM to get this approach working on Linux and Mac...

Tuesday 10 March 2009

Creating a Linux VM with EasyVMX and VMWare Player

When working out of the Seattle office of Insightful I had easy access to a wide range of Linux/UNIX boxes on the network. Hence I never got around to installing Linux for my own use.

The network access between Basingstoke and Seattle has gotten a little challenging with various IT changes, so it's time to get a local version of Linux on my machine. While I can imagine using a dual-boot configuration someday, at this point a virtual machine meets my needs.

It turns out that it's pretty straightforward to create a Linux virtual machine that runs on VMware Player. The pieces needed are:

  1. The EasyVMX web site to create the virtual machine description and virtual hard drive files.
  2. The VMware Player application.
  3. The Linux installation disk. I used the LiveISO version of Fedora 10.

With EasyVMX you just specify the OS type, RAM size, hard disk size, ISO file name, and various other optional settings. This then creates a zip with files representing the virtual hardware configuration for the requested specifications. If you specified the ISO file name, the machine will have that ISO loaded on startup.

Download, save, and unzip the zip file. Then open the "vmx" file with VMware Player. The LiveISO version of Fedora will start. This includes an option to install the software to the hard drive, which I did.

After you've installed the software, it'll prompt you to install a bunch of updates. I let it install all of them so it wouldn't bother me about them in the future.

Then you can install additional packages with the "yum" utility. There's a GUI application for software updates, but I went for the command line version. To see whether packages are available for a topic, use "yum list" with "grep":


     yum list | grep R 

To install a package, use "yum install":


     yum install R 

To install S+, download the images from the TIBCO download site and follow the standard S+ installation procedures.

Friday 6 March 2009

Dynamic Languages: R with Python

Previously I've discussed using Python as a primary general-purpose language calling S as a statistics and graphics engine. The capabilities for doing this have been improving over the years. First there was RSPython by Duncan Temple Lang, then rpy by Greg Warnes, and most recently rpy2 by Laurent Gautier.

Here's an example using rpy2 for principal components analysis:


import rpy2.robjects as robjects

r = robjects.r

m = r.matrix(r.rnorm(100), ncol=5)
pca = r.princomp(m)
r.plot(pca, main="Eigen values")
r.biplot(pca, main="biplot")




It's pretty readable for an S programmer, so this looks promising.

Unfortunately it doesn't currently seem to work with IronPython via IronClad. Perhaps someday...

Dynamic Languages: Python

Over the past few years I've thought a lot about how S compares with other programming languages. As discussed in a previous post, S is head and shoulders above general purpose programming languages in terms of the built-in statistical and graphics routines.

Having said that, S is a bit thread-bare in terms of general purpose infrastructure routines. I've spent entirely too much of my life writing S wrappers of Java routines to do things like:

  • Zip and unzip files

  • Create and parse XML

  • HTTP client operations


Various packages to do these sorts of things are around, but it's fair to say that the implementations are typically not as full featured as one would have in Java, C#, or Python. Also, some poor soul probably had to do the work of writing C code to interface R to a C library with the capabilities.

In a recent product development cycle we wanted to have a modern Windows GUI written in C# talking to a server written in Java that called a graphics system written in S. Oh, and the graphics device used C to create a graph file as well as Java to generate things like PDF from that file. It sounds a little crazy, but was driven by the desire to have a Windows rich client with a cross-platform server plus extensibility in S.

This meant jumping back and forth between multiple IDE's and some tricky debugging. It also meant essentially implementing some things twice: once in Java and once in C#. It left me longing for a system that could meet the following requirements without so many languages:
  • Cross-platform in terms of both operating systems and virtual machines. Runs on Windows, UNIX, Linux, perhaps Mac. Runs in the .NET CLR and the Java VM.

  • Has a rich standard library for string manipulation, network operations, etc. That is, it can do the bread-and-butter stuff needed by programs.

  • Uses an interpreted, loosely typed language that provides the same productivity benefits as S. Approachable and quick to learn for S programmers so that perhaps they could use it rather than S for extensibility.

Basically I wanted to have a single language that I could use to do .NET programming for the client, Java programming for the server, and ideally the type of programming one typically does in S.

The language that rose to the top was Python. The standard Python implementation is CPython which has been ported to oodles of platforms. There's also IronPython which runs on the .NET CLR and Jython that runs on the Java VM. With Mono and the Dynamic Language Runtime (DLR) it even runs on Android (Google's OS for cell phones).

I looked at bunches of other dynamic languages including Ruby, F#, Groovy, etc. None really had the right "feel" in terms of being something approachable to S programmers. Python has a sensible syntax that's easy to learn and follow. The most common complaint against it is the use of indentation as a flow-control mechanism rather than braces, and of course that an interpreted language typically isn't as quick as well-written C code.

The main limitation of IronPython and Jython is that while they can run pure Python modules they can't handle modules using C, with the most requested one being numpy. The guys at Resolver Systems are working to remedy this for IronPython via the IronClad project.

The big downside of Python by itself for S users is that it doesn't have the rich statistics and graphics available in S. So the thought of using Python alone rather than S is a non-starter. But I do think there's potential for using Python as a primary general-purpose language together with R as a statistics engine.

Not that I expect I'll be doing Python rather than C# and Java anything soon, but perhaps as the DLR matures it'll become a viable option.

Dynamic Languages: R in Ruby

One of the "hot" languages in the past few years is Ruby. My impression actually is that Ruby may have peaked a bit with the buzz for "Ruby on Rails". The good ideas from "Rails" are getting implemented in bunches of other frameworks such as MonoRail, Grails, etc.

But hey, I may be wrong. For Ruby proponents interested in some statistics a cool project is RinRuby. This provides Ruby functionality for invoking R as an embedded statistics engine.

Here's an example creating a graph and printing a correlation value from Ruby:

tally = Hash.new(0)
File.open('gettysburg.txt').each_line do |line|
line.downcase.split(/\W+/).each { |w| tally[w] += 1 }
end
total = tally.values.inject { |sum,count| sum + count }
tally.delete_if { |key,count| count < 3 || key.length < 4 }

require "rinruby"
R.keys, R.counts = tally.keys, tally.values

R.eval <<EOF
names(counts) <- keys
barplot(rev(sort(counts)),main="Frequency of Non-Trivial Words",las=2)
mtext("Among the #{total} words in the Gettysburg Address",3,0.45)
rho <- round(cor(nchar(keys),counts),4)
EOF

puts "The correlation between word length and frequency is #{R.rho}."



One of the cool things about RinRuby is that the implementation uses pure Ruby code, e.g. there's no C code behind the scenes. The Implementation Details explain:

RinRuby is a program which allows the user to run R commands from Ruby. An instance of R is created as a new object within Ruby, which allows R to remain open and running until the user closes the connection. There is no software that needs to be installed in R. Ruby sends data to R over TCP/IP network sockets, while commands and text are passed through the pipe. The pipe avoids compatibility issues on differing operating systems, platforms, and versions of Ruby and R, while the socket can handle large amounts of data quickly while avoiding rounding issues for doubles.

A benefit of this is that it is expected to work in JRuby and IronRuby.

This installed easily for me but didn't detect the location of R. I probably don't have the location of R in my registry. As a testament to the benefits of a scripting language I could fix this by editing the appropriate line in the "rinruby.rb" file to hardcode the location of R for my machine.

It ran properly under the Windows console application "irb". It gave an error under the "fxri" application which provides a console along with help information. I suspect that when running in a GUI the "rinruby" code is fighting it out with the GUI regarding redirection of stdout. Or it might be that a Windows console application has the stdout pipe and a Windows GUI application doesn't.

Dynamic Languages: Why S?

In the distant past I used BASIC, Pascal, APL, and Fortran. Then I started using S as a grad student, and hence C in order to speed up some of the computations. In the Java 1.2.1 era I took up Java, and around the introduction of C# 3 started doing some .NET programming.

In the past few years I've thought a lot about what the unique strengths are of S implementations (by which I mean R and S+).

If you cast your mind back to the dawn of S, the general purpose language commonly available for scientific computing was Fortran. To make Fortran a little more productive, SLAC created MORTRAN and Bell Labs created RATFOR. A crazy new language called C was also gaining traction.

The systems available for statistical computing were procedural languages such as SAS and SPSS. (SAS was implemented in PL/I.)

So imagine yourself doing statistical analysis in that environment. You don't have any of these crazy point-and-click things that we have today. You're very likely used to submitting your job to the computing center, going away for a bit, and then dropping by to pick up the line printer output later. Or perhaps you've moved a step up the technology chain and now have an interactive terminal to a timesharing system so you can run your batch job and immediately get results back on the screen.

Pretty frustrating for exploratory data analysis, eh? Suddenly along comes S where you can much more rapidly analyze your data. You can read in some text, view some summaries, and graph the results as iterative steps in an analysis. Even better, you can create your own functions to perform particular analyses without the complexity of the write/compile/run cycle needed with C and Fortran.

S really was special at the time in terms of the additional productivity for exploratory data analysis compared to the alternatives.

I'm sort of glossing over the time differences between the release of Sv1, Sv2, and Sv3. S as we know it today is pretty much an evolutionary improvement on Sv3 as described in "The New S Language" (1988). Feel free to clarify or comment on inaccuracies in the blog comments.

Well, that's fine and dandy but that was 20 years ago and things change. How does S stack up today?

The primary strengths I see in S engine implementations are:
  • A loosely typed dynamic language is much more productive to program in than a compiled language such as C, Fortran, Java, or C#. You can find any number of articles by proponents of Python or Ruby making the case for dynamic languages as productivity tools. In fact, Microsoft has put a lot of energy into IronPython and IronRuby in the last few years, and better support for dynamic languages is one of the primary enhancements planned for C# 4.

  • The range of statistical techniques that are available is unparalleled. SAS is in the same league and probably still dominates for mixed effect models, but the S implementations are ahead in other areas such as MCMC. C# and Java really don't go very far in terms of statistical capabilities. Python does a little better with numpy and scipy but more in terms of techniques for engineering than for general statistics. In all honesty I don't know where Stata, SPSS, Statistica, etc. rank these days but my impression is they have strengths in focused areas without the breadth of S+ and R.

  • Strongly integrated with the statistics is a world-class presentation graphics system. The C#/Java/Excel world is full of charting packages to do simple scatter plots, bar charts, and pie charts but that's pretty much as far as it goes. S has a much wider range of statistical graph types such as boxplots, and the ability to customize the graphs extensively. I also think the graphs just downright look good compared to the output of many graphing packages. It's sort of like comparing an equation from Latex with one from the Word equation editor.

In all honesty, I don't think the first point is a differentiating strength of S anymore. Sure, S is still more productive than C and Fortran. I'm even willing to argue that it's better for statistical computing than other older interpreted languages such as Lisp, Scheme, and Smalltalk. But I don't think it's distinguished itself as superior to the likes of Python and Ruby in terms of the language itself.

Providing a dynamic language is no longer a differentiating feature that makes S special. What does make it special is that it still provides a wealth of statistical and graphical capabilities that aren't available in other languages. Putting it another way, it isn't the language that makes S valuable, it's the statistics and graphics class libraries.

The motivation for this discussion is that by acknowledging the strengths and weaknesses of S as a language one can start to explore combining S with other dynamic languages that are strong in areas where S is weaker. Future posts will elaborate on this topic.

S+ Tips: startsWith() and endsWith()

Whenever I go from using C# or Java back to using S, one of the annoyances is that the standard string methods available in most languages aren't around. That is, if one wants something more convenient than straight calls to grep().

I usually end up defining some wrappers to let me do "starts with" and "ends with" operations on single string values:


"strStartsWith" <- function(str, prefix){
# Test whether a string starts with a particular prefix.
# case insensitive
if (is.null(str) !is.character(str) length(str) != 1){
stop("The 'str' argument must be a single string value.")
}
(length(grep(paste("^", prefix, sep=""), str, ignore.case = TRUE)) > 0)
}

"strEndsWith" <- function(str, suffix){
# Test whether a string ends with a particular suffix.
# case insensitive
if (is.null(str) !is.character(str) length(str) != 1){
stop("The 'str' argument must be a single string value.")
}
(length(grep(paste(suffix, "$", sep=""), str, ignore.case=TRUE)) > 0)
}




To make these polished S functions it would be better to make them vectorized rather than throwing the error.

I considered implementing these by calling into Java, but went with the grep() approach so they could be used without loading Java. The downside of using Java is just that the startup time for a Java application is significantly slower than for a straight C application, and this in turn leads to slower startup of S+ when Java is loaded.

Another thing to note is that I used the name strStartsWith() rather than just startsWith(). When actually defining these in a package I'd really use a prefix common to all the functions in the package. For example, in the "fm" package it would be fmStartsWith(). I do this because S+ just has one big namespace, so I want to avoid using up commonly-used names that might have a different purpose in another package.

Thursday 26 February 2009

What's up with S+ 8.1?

Eagle-eyed S users may have noticed that S+ 8.1 is now available. The desktop version came out in December and TIBCO just released S+ Server and S+ Miner.

As the Insightful web site is being phased out, the new release isn't discussed there. Instead it's over on the TIBCO Spotfire web page:

http://spotfire.tibco.com/products/s-plus.cfm

Some exciting innovations in this release:

  • The language is no longer "S-PLUS", it's now "S+". Sort of like the whole "Puff Daddy" to "P Diddy" transition.
  • The products are now "TIBCO Spotfire S+", "TIBCO Spotfire S+ Server", and "TIBCO Spotfire S+ Miner". You'll need to get used to looking for things in the "TIBCO" locations rather than the "Insightful" locations. The name reflects the fact that the products are part of the Spotfire product family within the TIBCO offerings.
  • The software is now obtained by downloading from the TIBCO download site rather than on a CD. Needless to say, printed manuals are also a thing of the past.

Oh yeah, there are some new features too. The highlights are available at:

http://spotfire.tibco.com/products/whatsnew_splus.cfm

The best place on the web at the moment for more detail is actually Northwestern, as they put links to the S+ 8.1 for Linux documentation on their web site. Northwestern people - if that's intended for internal use only, let me know and I'll edit this out.

Material is starting to roll out on the "Spotfire Technology Network" web site:

Integrating with Spotfire S+
Spotfire S+ Server Primer (C#)

Keep an eye on the STN for future examples.

Why "statsci"?

This blog and the related web site are associated with the user "statsci@gmail.com". StatSci was the name of the company that did the first commercial version of the S language. In a long and tragic tale, StatSci was purchased by MathSoft which became Insightful which was purchased by TIBCO.

When setting up the blog I wanted to have a user name and web site just for my individual technical postings. That is, one not connected with my personal or work email accounts. It turns out that a surprising number of variations on my name were already taken, but "statsci" was available on gmail, Windows live, Blogger, and Google Sites. So it's mine now.

It wasn't available on Yahoo, and the following sites have nothing to do with me:

http://www.statsci.com/
http://www.statsci.org/
http://www.statsci.net/
http://www.statsci.co.uk/

So there's prior art on the recycling of this name for other purposes.

The site that goes with this blog is at:

http://sites.google.com/site/statsci/

Welcome!

Welcome to my new blog on statistical computing. The emphasis will be on S language implementations such as S+ and R. I expect it'll also cover interesting aspects of Java, Python, and .NET.