Skip to content

Latest commit

 

History

History
318 lines (195 loc) · 6.59 KB

File metadata and controls

318 lines (195 loc) · 6.59 KB
marp theme paginate license title author
true
marp-theme_dataplant-ceplas-ccby
true
Data Storage and Versioning

Data Storage and Versioning


Data stores

w:900


Backup vs. Archive


Backup Archive
Storage type Short-, mid-term Long-term
Purpose Disaster recovery Long-term storage, compliance
Reason Duplication Migration
Usage Work in progress Cold, Unused data
Changes Short-term updates No updates
Trend Cyclic, Replacement Growing
Latency Short/Costly High/Cheaper

3-2-1 backup rule

w:800


Version control and track changes

It’s good practice to document:

  • What was changed?
  • Who is responsible?
  • When did it happen?
  • Why the changes?

Types of Version Control

  • by file name (_v1, _v2)
  • cloud services
    • dropbox, icloud, gdrive
  • distributed version control system
    • e.g. Git

Which files need to be "versioned"? 📝

  • paper manuscript (.docx)
  • single-cell RNASeq reads (.fastq.gz)
  • spread sheet with photometer measurements (.xlsx)
  • calendar invitation (.ical)
  • photo of SDS-PAGE (.jpeg)
  • excel workbook with calculations (.xlsx)
  • presentation for a conference (.pdf)
  • data analysis script (.py)

Concept of Git and git-based platforms


Cloud Services

bg right:50% w:800

✓ Documents
✓ Small data
✓ Presentations

X Code
X Data analytical projects
X Big (“raw”) data


Git and git platforms

bg right:50% w:800

∼ Documents ✓ Small data
∼ Presentations

✓✓ Code ✓✓ Data analytical projects ∼ Big (“raw”) data


Why git? ≈> Why code?

  • Save time
  • Avoid doing repetitive tasks “by hand”
  • Reuse scripts, analyses, pipelines
  • Reproduce results

A simple example: RNASeq project

w:900


A simple example: RNASeq project

w:900


A simple example: RNASeq project

w:900


A simple example: RNASeq project

w:900


A simple example: RNASeq project

w:900


Take snapshots of your code work…

(... as long as it works)

w:900


Take snapshots of your code work…

(... as long as it works)

w:900


Scenario 1: More data

w:900


Scenario 1: More data

w:900


Scenario 1: More data

w:900


Let git track changes and keep things clean

w:900


Scenario 2: Pipeline breaks

w:900


Revert to snapshot

w:900


Scenario 3: New project, same type of data and analysis

w:900


Scenario 3: New project, same type of data and analysis

w:900


Re-use code

w:900


Re-use code

w:900


Re-use code – People have done this

w:900


Re-use code – People have done this

w:900


Re-use code – Link and contribute

w:900


Git: summary

  • Version control system
  • Git “repository” = a central data package (directory)
  • Allows to track changes to any file in the repository
    • What was changed
    • When was it changed
    • By whom was it changed
    • Why was it changed?

GitHub and GitLab

  • A well-documented cloud environment
  • Active syncing
  • Not automatically synced
  • Non-automated version control
  • You have the control what changes to track and what to sync
  • Time machine to go back to older versions

GitHub and Gitlab team projects

Simplifies concurrent work & merging changes

  • Online service to host our projects
  • Share code with other developers
  • Others can download our projects, work on and contribute to them
  • They can upload their changes and merge them with the main project

Cloud vs. Git

w:1000



Contributors

Slides presented here include contributions by