Friday, 6 March 2009

Dynamic Languages: Why S?

In the distant past I used BASIC, Pascal, APL, and Fortran. Then I started using S as a grad student, and hence C in order to speed up some of the computations. In the Java 1.2.1 era I took up Java, and around the introduction of C# 3 started doing some .NET programming.

In the past few years I've thought a lot about what the unique strengths are of S implementations (by which I mean R and S+).

If you cast your mind back to the dawn of S, the general purpose language commonly available for scientific computing was Fortran. To make Fortran a little more productive, SLAC created MORTRAN and Bell Labs created RATFOR. A crazy new language called C was also gaining traction.

The systems available for statistical computing were procedural languages such as SAS and SPSS. (SAS was implemented in PL/I.)

So imagine yourself doing statistical analysis in that environment. You don't have any of these crazy point-and-click things that we have today. You're very likely used to submitting your job to the computing center, going away for a bit, and then dropping by to pick up the line printer output later. Or perhaps you've moved a step up the technology chain and now have an interactive terminal to a timesharing system so you can run your batch job and immediately get results back on the screen.

Pretty frustrating for exploratory data analysis, eh? Suddenly along comes S where you can much more rapidly analyze your data. You can read in some text, view some summaries, and graph the results as iterative steps in an analysis. Even better, you can create your own functions to perform particular analyses without the complexity of the write/compile/run cycle needed with C and Fortran.

S really was special at the time in terms of the additional productivity for exploratory data analysis compared to the alternatives.

I'm sort of glossing over the time differences between the release of Sv1, Sv2, and Sv3. S as we know it today is pretty much an evolutionary improvement on Sv3 as described in "The New S Language" (1988). Feel free to clarify or comment on inaccuracies in the blog comments.

Well, that's fine and dandy but that was 20 years ago and things change. How does S stack up today?

The primary strengths I see in S engine implementations are:
  • A loosely typed dynamic language is much more productive to program in than a compiled language such as C, Fortran, Java, or C#. You can find any number of articles by proponents of Python or Ruby making the case for dynamic languages as productivity tools. In fact, Microsoft has put a lot of energy into IronPython and IronRuby in the last few years, and better support for dynamic languages is one of the primary enhancements planned for C# 4.

  • The range of statistical techniques that are available is unparalleled. SAS is in the same league and probably still dominates for mixed effect models, but the S implementations are ahead in other areas such as MCMC. C# and Java really don't go very far in terms of statistical capabilities. Python does a little better with numpy and scipy but more in terms of techniques for engineering than for general statistics. In all honesty I don't know where Stata, SPSS, Statistica, etc. rank these days but my impression is they have strengths in focused areas without the breadth of S+ and R.

  • Strongly integrated with the statistics is a world-class presentation graphics system. The C#/Java/Excel world is full of charting packages to do simple scatter plots, bar charts, and pie charts but that's pretty much as far as it goes. S has a much wider range of statistical graph types such as boxplots, and the ability to customize the graphs extensively. I also think the graphs just downright look good compared to the output of many graphing packages. It's sort of like comparing an equation from Latex with one from the Word equation editor.

In all honesty, I don't think the first point is a differentiating strength of S anymore. Sure, S is still more productive than C and Fortran. I'm even willing to argue that it's better for statistical computing than other older interpreted languages such as Lisp, Scheme, and Smalltalk. But I don't think it's distinguished itself as superior to the likes of Python and Ruby in terms of the language itself.

Providing a dynamic language is no longer a differentiating feature that makes S special. What does make it special is that it still provides a wealth of statistical and graphical capabilities that aren't available in other languages. Putting it another way, it isn't the language that makes S valuable, it's the statistics and graphics class libraries.

The motivation for this discussion is that by acknowledging the strengths and weaknesses of S as a language one can start to explore combining S with other dynamic languages that are strong in areas where S is weaker. Future posts will elaborate on this topic.

No comments:

Post a Comment