Our corpus is your corpus.

[ Here is our P3P corpus, be it your corpus as well. ]

Giving away your corpus (in empirical language analysis or elsewhere) is perhaps nothing extremely established, but it's nothing original or strange either. It does make sense a lot! Stop limiting the impact of your research efforts! Stop wasting the time of your community members! Sharing corpora is one of these many good ideas of Research 2.0: see SSE'10 (and friend events), eScience @ Microsoft, R2oSE, ...

Computer Science vs. Science

When you do academic CS research in programming- or software development-related contexts, then the culture of validation is these days such that you are often expected to provide online access to your program, application, library, tool, what have you as an implementation or illustration. There are various open-source repositories that are used to this end--as a backend (a storage facility), but any sort of author-hosted download locations are also used widely. In basic terms, if you write a paper, you include a URL. (There is one exception: if your work leverages Haskell, you can usually include the complete source code right into your paper so that one gets convenient access through copy and paste. Sorry for the silly joke.) Metadating-wise, common practice is nowhere perfect, but it's perfect compared to what follows.

When you do empirical analysis in CS, which results in some statements or data about software/IT artifacts, then the culture of validation is essentially the one of science. In particular, reproducibility is the crucial requirement. You describe the methodology of your analysis in a detailed manner. So you define your hypotheses, your input, your techniques for measurement, your results (which you also interpret), your threats to validity, what have you. Downloads aren't integral with science. What would you want to download anyway?

Message of this post?!

I suggest that various artifacts of an empirical analysis in CS, in general; in empirical language analysis, in particular, qualify for a valuable download. In this post, I want to call out the corpora (as in corpora of source projects, buildable projects, built projects, runnable projects, ran projects, demos, etc.).

Beyond reproducibility in CS

What's indeed not yet commonplace (if done ever) is that the corpora underlying empirical analyses are given convenient access to. Consider for example Baxter et. al's paper on structural properties of Java software, or Cranor et al.'s paper on P3P deployment. These are two seminal papers in their fields. I would loose a night of sleep over each of the two corpora.
Wouldn't it be helpful for researchers if such corpora were made available for one-click download incl. useful metadata, potentially even tooling? Let's suppose such convenient access became a best practice. First, reproducibility would be improved. Second, derived research would be simplified. Third, incentives for collaboration would be added.

I contend that convenient access adds little pain for the original author, but adds huge value for the scientific community. Why should we need to execute the description of some corpus from some paper, if it requires substantial work for us, but the corpus would be easily shared by the primary author. Why should we work hard to "reproduce the corpus" if some little help by the original authors would make reproducibility (of the corpus and most of the research work, perhaps) a charm.

Naysayers -- get lost

I can think of many reasons why 'convenient access' is not getting off the ground. Here are few obvious options:
  • "It's extra work, even if it is little extra work." This problem can be solved if incentives are created. For instance, publications on empirical analysis with 'convenient access' to the corpus could be rated higher than those w/o. Also, just like tool papers in many venues, there could be corpus papers.
  • "There is sufficient, inconvenient access available already." At least, for one of the two examples above, I fully understand how I could go about gathering the corpus myself, but I have not executed this plan, even though I could really use this corpus in some ongoing research activity. It's just too much work for me. I am effectively hampered in benefitting from the authors' research beyond their immediate results.
  • "Provision of convenient access is too difficult." Think of a corpus of Java programs. Suddenly, an access provider gets into the business of configuration management. After all, convenience would imply that the corpus builds and runs out of the box. I think the short-term answer is that access to the corpus w/o extra "out-of-the box" magic is still more convenient than no access. The long-term-answer is that we may need a notion of remote access to corpora, where I can give you access to my corpus in my environment, through appropriate, web-based or service-oriented interfaces.
  • "Convenient access gives a head start to the competition." I refuse to believe that this is really too relevant in academic practice. For instance, I am sure that the research groups behind the above-mentioned papers have no "corpus monopoly" in mind. I have not done much work on empirical analysis, but I have experience with papers that "give away details", and I must say that those papers which give away the most typically coincide with those which have the highest impact in all possible ways.
  • "There is copyright & Co. in the way." Yes, it is. This is a serious problem, and we better focus on solving the problem shortly, if we want to get anywhere with science and (IT) society in this age. This post will just explode if I tried to comment on that issue over here. There are many good ideas around on this issue, and we all understand that some amount of sharing works even now in this very imperfect world as we have it. If you are pro-Research 2.0, don't get bogged down by this red herring.
Well, I can think of quite a number of other reasons, but I reckon that all the usual suspects have been named, and everything else can be delegated perhaps to some discussion on this blog or elsewhere.

Ralf Lämmel

PS: CS is of an age that empirical research is becoming viral and vital. I am grateful for talking to Jean-Marie occasionally with his lucid vision of Research 2.0 and linguistics for software languages---two topics that are strongly connected. Empirical analysis of software languages has got to be an integral part of software language linguistics. Specialized software-engineering conferences like SLE, ICPC and MSR or even big ones like ICSE or OOPSLA include empirical research for a while now.


  1. I could not agree more with the need for better data sharing practices within and among research communities in Empirical Software Engineering. If you are interested in this topic, you may want to check out the RESER workshop at ICSE 2010, coming up in May (http://sequoia.cs.byu.edu/reser2010). Research reporting (including data sharing and tooling) is an important sub-topic of replication.

  2. Thanks, Jonathan. I was indeed not aware of the workshop. Cool.