Thursday, February 11, 2016

Reproducibility in Computer Science

There has been a lot of discussion lately about reproducibility in the sciences, especially the social sciences. The result that garnered the most attention was the Nosek study, where the authors tried to reproduce the results of 98 studies published in psychology journals. They found that they were able to reproduce only about 40% of the published results.

Now it's computer science's turn to go under the spotlight. I think this is good, for a number of reasons:

  1. In computer science there is a lot of emphasis placed on annual conferences, as opposed to refereed journal articles. Yes, these conferences are usually refereed, but the reports are generally done rather quickly and there is little time for revision. This emphasis has the unfortunate consequence that computer science papers are often written quite hastily, a week or less before the deadline, in order to make it into the "important" conferences of your area.

  2. These conferences are typically quite selective and accept only 10% to 30% of all submissions. So there is pressure to hype your results and sometimes to claim a little more than you actually got done. (You can rationalize it by saying you'll get it done by the time the conference presentation rolls around.)

    (In contrast, the big conferences in mathematics are often "take-anything" affairs. At the American Mathematical Society meetings, pretty much anyone can present a paper; they sometimes have a special session for the papers that are whispered to be junk or crackpot stuff. Little prestige is associated with conferences in mathematics; the main thing is to publish in journals, which have a longer time frame suitable for good preparation and reflection.)

  3. A lot of research in computer science, especially the "systems" area, seems pretty junky to me. It always amazes me that in some cases you can get a Ph.D. just for writing some code, or, even worse, just modifying a previous graduate student's code.

  4. Computer science is one of the areas where reproducibility should (in theory) be the easiest. Usually, no complicated lab setups or multimillion dollar equipment is needed. You don't need to recruit test subjects or pass through ethics reviews. All you have to do is compile something and run it!

  5. A lot of computer science research is done using public funds, and as a prerequisite for obtaining those funds, researchers agree to share their code and data with others. That kind of sharing should be routine in all the sciences.
Now my old friend and colleague Christian Collberg (who has one of the coolest web pages I've ever seen) has taken up the cudgel of reproducibility in computer science. In a paper to appear in the March 2016 issue of Communications of the ACM, Collberg and co-authors Todd Proebsting and Alex M. Warren relate their experiences in (1) trying to obtain the code described in papers and then (2) trying to compile and run it. They did not attempt to reproduce the results in papers, just the very basics of compiling and running. They did this for 402 (!) papers from recent issues of major conferences and journals.

The results are pretty sad. Many authors had e-mail addresses that failed (probably because they moved on to other institutions or left academia). Many simply did not reply to the request for code (in some cases Collberg filed freedom of information requests to try to get it). Of those that did reply, their code failed for a number of different reasons, like important files missing. Ultimately, only about a half of all papers had code that passed the very basic tests of compiling and running.

This is going to be a blockbuster result when it comes out next month. For a preview, you can look at a technical report describing their results. And don't forget to look at the appendices, where Collberg describes his ultimately unsuccessful attempt to get code for a system that interested him.

Now it's true that there are many reasons (which Collberg et al. detail) why this state of affairs exist. Many software papers are written by teams, including graduate students that come and go. Sometimes they are not adequately archived, and disk crashes can result in losses. Sometimes the current system has been greatly modified from what's in the paper, and nobody saved the old one. Sometimes systems ran under older operating systems but not the new ones. Sometimes code is "fragile" and not suitable for distribution without a great deal of extra work which the authors don't want to do.

So in their recommendations Collberg et al. don't demand that every such paper provide working code when it is submitted. Instead, they suggest a much more modest goal: that at the time of submission to conferences and journals, authors mention what the state of their code is. More precisely, they advocate that "every article be required to specify the level of reproducibility a reader or reviewer should expect". This information can include a permanent e-mail contact (probably of the senior researcher), a website from which the code can be downloaded (if that is envisioned), the degree to which the code is proprietary, availability of benchmarks, and so forth.

Collberg tells me that as a result of his paper, he is now "the most hated man in computer science". That is not the way it should be. His suggestions are well-thought-out and reasonable. They should be adopted right away.

P. S. Ironically, some folks at Brown are now attempting to reproduce Collberg's study. There are many that take issue with specific evaluations in the paper. I hope this doesn't detract from Collberg's recommendations.

No comments: