Friday 23 December 2016

collaboration - How to integrate partial version control, data exchange and research assistants?


Currently, my coauthors and I use GitHub to collaborate in coding and writing but also in data exchange. We have a lot of data, often not in text format (e.g. pdf). Most of this is collected by research assistants, which which we don't share our Git repo.


More specifically, we use python, shell, R, Stata and Latex and most of it is fully integrated. That is, python and shell scripts generate the data that is used by R and Stata whose output is directly compiled in Latex.


We don't want to deviate from this high level of automatization, but our approach has two major shortcomings:



  1. Research assistants can and may not upload their data directly into our repo. Instead, they send us their data and files via another git repo. This puts additional workload on me but we want to make them use git for it's fascinating issue tracker. However, there is too much additional work for us and git is often to complicated for young research assitants (even the GUIs).


  2. Because of the data, which eventually changes, our repo is very large (>1,5GB) and internet is sometimes restricted. Getting the repo on a new computer is very time-consuming and often not working. Although the raw data changes from time to time, git, which was not made for the data exchange, keeps track of these changes. But that's useless for our purpose.


Can you suggest to me other software(s) or approaches, which combine the integration we have achieved so far where we can easily exchange data?



Answer



I've run into similar problems in collaboration with biologists, and found that a two-technology approach is best.



  • Large-scale experimental data is shared via one or more BitTorrent Sync folders. The experimentalists can just drop their files into the appropriate place in the folder, and they get synchronized with the server and everybody else's copy (like DropBox, but private, free, and unlimited size).

  • The analytical tool-chain, research products, write-ups, etc. are maintained in your preferred version system (e.g., git). This is then integrated with the experimental data just by giving a pointer to the appropriate directories.


This has the advantage of maintaining the data separation that you need, keeping the scary version control software away from the experimentalists, and also avoiding juggling massive globs of data in a version control system that was never intended to support this.



No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?

The title is the question. If additional specificity is needed I will add clarification here. Are there any multicellular forms of life whic...