Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version control of data #8

Open
nheeren opened this issue Jun 14, 2018 · 4 comments
Open

Version control of data #8

nheeren opened this issue Jun 14, 2018 · 4 comments

Comments

@nheeren
Copy link
Member

nheeren commented Jun 14, 2018

In a larger project, we have the issue that we would like to create a database on github. However, GitHub is meant to keep track of changes in text files and we are using binary files (xlsx) for now. That means uploading new versions of the data files will eventually cause very large overhead over time and no meaningful version control is possible. I could see that the final data will be converted to csv at some point, but so far, this database is a moving target and we would like to use Excel files for now.

Can we add guidelines or recommendations in the wiki on how to do version control of IE datasets and databases? Any suggestions are very much welcome.

@tmillross
Copy link

Hi @nheeren, whilst very useful, this would be a challenging request to satisfy!

how to do version control of IE datasets and databases?

There are many methods for version controlling (VC) data. The format of the data being tracked and the type of changes you want to capture will affect which option is optimal for each use case.

You can hack around a bit to use GitHub for VC, but as you mention it's not likely to be a good choice. If keen to use a spreadsheet - is it an option to use a web based one such as Google Sheets? The automatic VC there may suit your needs.

If they're in a 'proper' database running within a database management system (DBMS), then there are some very solid options available. For instance I used to use Change Data Capture in SQL Server, which probably does what you're looking for. Or slightly more old-school but available in full open-source: Posrgresql triggers.

Basically the tech-stack used by each group in the IE community will determine which body of technical knowledge they'll need to master to achieve reliable tracking of their data. If you're flexible and looking for recommendations on an appropriate tech stack to choose for an upcoming open-science project then that's another question!

@nheeren
Copy link
Member Author

nheeren commented Jun 18, 2018

Thanks @tmillross!

Sorry, I should have been more clear about my objectives. The reason why I opened this issues was to start a discussion and identify best practices for IE Open Science that could later become part of a guideline. The goal would be to keep the project database as reproducible as possible and very much in the IE Open Science spirit. So VC is only part of the requirements.

  • While Google Sheets may be VC-able, I don't believe they will ever be a good platform for Open Science data.
  • Running a database seems like the most practical solution to me, but it implies a that a database structure is in place and maintained. I could imagine that we will find that this is the way to go in the future and we recommend that the society established such database infrastructure.

@ricklupton
Copy link

Have you seen https://datbase.org/?

Looks cool but it's pretty new -- haven't used it for anything serious myself but some people are, they have examples on their blog.

@tmillross
Copy link

Having a history of how files have changed is essential for effective collaboration and reproducibility. Git has been promoted as a solution for history, but it becomes slow with large files and a high learning curve. Git is designed for editing source code, while Dat is designed for sharing files. With a few simple commands, you can version files of any size. People can instantly get the latest files or download previous versions.
In sum, we've taken the best parts of Git, BitTorrent, and Dropbox to design Dat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants