Currently, my coauthors and I use GitHub
to collaborate in coding and writing but also in data exchange. We have a lot of data, often not in text format (e.g. pdf
). Most of this is collected by research assistants, which which we don't share our Git
repo.
More specifically, we use python
, shell
, R
, Stata
and Latex
and most of it is fully integrated. That is, python
and shell
scripts generate the data that is used by R
and Stata
whose output is directly compiled in Latex
.
We don't want to deviate from this high level of automatization, but our approach has two major shortcomings:
- Research assistants can and may not upload their data directly into our repo. Instead, they send us their data and files via another
git
repo. This puts additional workload on me but we want to make them use git for it's fascinating issue tracker. However, there is too much additional work for us andgit
is often to complicated for young research assitants (even the GUIs). - Because of the data, which eventually changes, our repo is very large (>1,5GB) and internet is sometimes restricted. Getting the repo on a new computer is very time-consuming and often not working. Although the raw data changes from time to time,
git
, which was not made for the data exchange, keeps track of these changes. But that's useless for our purpose.
Can you suggest to me other software(s) or approaches, which combine the integration we have achieved so far where we can easily exchange data?
Answer
I've run into similar problems in collaboration with biologists, and found that a two-technology approach is best.
- Large-scale experimental data is shared via one or more BitTorrent Sync folders. The experimentalists can just drop their files into the appropriate place in the folder, and they get synchronized with the server and everybody else's copy (like DropBox, but private, free, and unlimited size).
- The analytical tool-chain, research products, write-ups, etc. are maintained in your preferred version system (e.g., git). This is then integrated with the experimental data just by giving a pointer to the appropriate directories.
This has the advantage of maintaining the data separation that you need, keeping the scary version control software away from the experimentalists, and also avoiding juggling massive globs of data in a version control system that was never intended to support this.
No comments:
Post a Comment