30 April 2012

The Tail Wags the Dog, 95% CI

There's been a spate of R pieces recently, dealing with R as a programming language, and in particular, its assumed deficiencies. Here, and and here, and here are examples.

It's a bunch of tails wagging the dog, and doesn't address the real question: how to make R the de-facto stat pack where SAS, SPSS, and Stata tread currently. As mentioned in the Triage piece some reviews of R are concerned with how much is R, how much C, and how much Fortran. Various reviewers have been puzzled by the poles: more R than expected, and less R than expected. There are reported to be 3,800 packages in CRAN, and rather fewer (554) in Bioconductor. Call it 4,400 in round numbers. Assume that a package has, on average, 3 authors, which I think is generous, given how many grad students are involved (hell, Hadley Wickham does ggplot2 all by his lonesome). That's 13,200 folks.

So, we have 2,000,000 useRs. We have 13,200 "developers" (not counting the core team maintaining the language). Which group should the "language" serve? Clearly, the 2,000,000. In particular, insurgency into the SAS/SPSS beachhead will not be supported by emphasizing R as a coders' paradise (it isn't; too many warts), rather than as an analysts' golden sword. It seems to me, having used most stat packs and "real" programming languages over the years, that this divided duties situation is what makes for some of the oddities of R. Oddities both from a command writer's point of view, as well as a coder's. The first link has all the gory details. I have used one 4GL, Progress, which was the best in high button shoes for databases in the early 1990's, which was bootstrapped. But the audience was other coders (the report generator had its own syntax, a bit of RPG), not analysts, so having one syntax for two groups of coders wasn't a big deal. With R, the two constituencies are much more different.

One can build a language successfully, while not being a (group of) language designer by profession: Perl and Ruby being the two most well known examples. Contrast with python (defined by a mathematician) and java (language builder). Which of these one finds most comfortable as a working syntax says more about oneself than the language. For what it's worth, python. I've read Chamber's book, and a good deal of others, and I'm still not clear about why or how R's syntactical oddities are supposed to serve uniquely the purpose of stats. Clearly, the vector paradigm comes from Fortran and BMDP and does fit. The rest, not so much. And it is true that numerical programming has been on a drift from Fortran to C for some time; one can argue that this represents a lowering of the semantic, and thus not helpful.

As I commented on a post, Rcpp is the likely platform for development going forward; how soon, I can't say. Such a transition will mean that stat folks won't be the driving authors anymore, unless they choose to be real coders too (or primarily). This opinion is all based on the received wisdom that what's holding R back from displacing SAS/SPSS is speed. I don't think it's the open source thing, really. After all, even IBM uses linux. The file based structure of SAS/SPSS does have advantages.

No comments: