Sunday 12 March 2017

Reference on availability of source code used in computer science research articles?


I have several questions related to open-sourcing the source code used for a research article.


Is there any research/study that addresses any of the following:



  • What percentage of research articles are provided with their source code? (i.e. the source code is made available somewhere online)


  • What percentage of research articles provide the source code at or before the publication date?

  • What percentage of research articles who promised they will release the source code actually do so?

  • What percentage of research articles provided the source code at some point but then the latter disappeared?


I am mostly interested in the field of computer science > machine learning, and English-speaking venues.



Answer



Here's a relevant study on computer science systems research that addresses your first question, "What percentage of research articles are provided with their source code?". The study is described in a tech report:



"Measuring Reproducibility in Computer Systems Research." Christian Collberg, Todd Proebsting, Gina Moraila, Akash Shankaran, Zuoming Shi, Alex M Warren. March 21, 2014.




The authors of this study observed the following protocol to determine code availability:



We downloaded 613 papers from the latest incarnations of eight ACM conferences (ASPLOS’12, CCS’12, OOPSLA’12, OSDI’12, PLDI’12, SIGMOD’12, SOSP’11, VLDB’12) and five journals (TACO’9, TISSEC’15, TOCS’30, TODS’37, TOPLAS’34), all with a practical orientation. For each paper we determined whether the published results appeared to be backed by source code or whether they were purely theoretical. Next, we examined each non-theoretical paper to see whether it contained a link to downloadable code. If not, we examined the authors’ websites, did a web search, examined popular code repositories such as github and sourceforge, to see if the relevant code could be found. In a final attempt, we emailed the authors of each paper for which code could not be found, asking them to direct us to the location of the source. In cases when code was eventually recovered, we also attempted to build and execute it. At this point we stopped — we did not go as far as to attempt to verify the correctness of the published results.



Here is a summary of their findings:



  • Total papers examined: 613

  • Papers that appeared to be backed by source code (not purely theoretical): 515


Of these 515 papers, 105 were excluded from consideration so that the resulting set of papers had no overlapping author lists. That leaves 410 papers, with results as follows:




  • Papers with link to source in the paper: 85

  • Papers not in above category, where source was found via web search: 65

  • Papers where author shared source following email request: 81

  • Papers where author declined to share source following email request: 149

  • Papers where author did not respond to email requests for source: 30


More details of methodology and results, as well as all the data and other materials used in this study, may be found at this web site.


There is an "Anecdotes" section appended to this tech report, which I think you may find very interesting, as it relates to some of the other points in your question. It documents the author's struggles to get authors to give up their source code :)


No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?

The title is the question. If additional specificity is needed I will add clarification here. Are there any multicellular forms of life whic...