In January, I read an article entitled 3 simple things GitHub can do for science. The title is pretty self-explanatory; it proposes three concrete steps Github can take in order to increase its utility for scientific research. “It’s nice to dream,” I thought. On Wednesday, I read that Github has actually (mostly) implemented one of the three recommendations. On the occasion of this nice surprise, I’d like to reflect on what’s happened, why it’s nice, and what more there remains to do.
As the technical sophistication of science increases, reproducing results becomes not just a matter of having proper experimental equipment, avoiding mistakes, and technical prowess; but rather a matter of having access to the specific computer files that other researchers used. At least in linguistics, the culture of the field has not entirely kept pace. The upshot is that the half-life of scientific data is 17 years.
Github can help with this problem. It provides a platform on which data can be made available. It has several other desirable features as well. All files on Github have included with them a full version history, which makes it possible to track refinements over time. And Github’s social features allow you to track files as they are edited by multiple people over time.
So it’s good news that it is possible to obtain DOIs for Github repositories, making it easier to cite them in a familiar manner. Indirectly, this will (hopefully) encourage the wider dissemination of data on Github. The system is still not without its kinks: I tried to use it on my one publicly available project (in collaboration with Meredith Tamminga), but failed because the system appears to take account only of new releases, not previously-available ones. I’ve reported this issue, and hopefully it will be resolved in short order.
It’s axiomatic on the internet that technical solutions for social problems almost never work, and ultimately Github’s move is a technical. It’s not going to solve the problem that few people want to be the first penguin off the iceberg in terms of exposing their research to critical examinataion (despite the benefits that open access provides).
<a title="Brocken Inaglory [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0-2.5-2.0-1.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons" href=""><img width="512" alt="Chinstrap Penguins at iceberg" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/d0/ChinstrapPenguinsaticeberg.jpg/512px-ChinstrapPenguinsaticeberg.jpg"/></a>
And my own experience is that even with the best of intentions it’s hard work to prepare data to be shared. But Github has done an incalculable amount of good to the cause of software freedom, and I’m optimistic to see them engaging with the scientific community. I hope that we continue to see improvements on this front, technical as well as social. You can help by getting DOIs for your scientific code and data that’s already available.