Skip to content

Commit

Permalink
Merge pull request #486 from nfdi4plants/reproducibility
Browse files Browse the repository at this point in the history
Reproducibility
  • Loading branch information
Brilator authored Oct 29, 2024
2 parents 346a445 + d7eccec commit 48c6eb1
Show file tree
Hide file tree
Showing 2 changed files with 257 additions and 0 deletions.
117 changes: 117 additions & 0 deletions src/docs/fundamentals/CommonWorkflowLanguage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---
layout: docs
title: Reproduce and reuse
date: 2022-08-08
author:
- name: Dominik Brilhaus
github: https://github.com/brilator
orcid: https://orcid.org/0000-0001-9021-3197
add toc: true
add sidebar: _sidebars/mainSidebar.md
status: draft
---

> Note: This is just a first collection of thoughts.
> Could be partitioned into fundamentals/implementation/tutorial
## Fundamentals: (code / software) reproducibility


Reproducibility in science overall

wet lab | dry lab
--- | ---
company RNA extraction kit with all buffers and most of materials and tools| established / (commercial) software; somewhat contained, isolated, self-sustained
"manual" protocol where you buy and mix buffers together yourself | script or combinations of scripts (pipeline) with varying inputs (reference data sets) and tool dependencies (code interpreters, packages, functions)
version, batch or LOT number | software / package version
laboratory environment | operating system


In the wet-lab many more factors affect reproducibility, making it close to impossible to reproduce the exact same outcomes (results, datasets)
- biological variance
- hands-on factor (more hands, bigger variance)
- environment (humidity, temperature), but also standard devices (growth chamber, centrifuge)



- Reproducibility of computational analyses
- a) you can "reproduce" that exact same output (run result) using the exact same inputs
- b) you can apply the analysis onto other data to produce analogous outputs, that can be fed into other workflows (e.g. generate similar figures)

- How we usually (learn to) work with scripts
- interactive, iterative
- adapt script to specific needs
- write (hard-code) inputs, outputs into script

- Problem
- hand script to colleague
- script not working due to missing (software) dependencies, changed (absolute) paths to environments / inputs / other dependencies (e.g. database resources)

- Example sources for scripts
- workshop / summer school
- colleagues
- manual / tutorial to a tool (downloaded and adapted from GitHub)
- copy/pasted from stack overflow

- Software dependencies
- on multiple levels / in different shapes
- operating system (Linux, Windows, Mac)
- programming environment / interpreter (shell, python, r, julia, f#)
- packages / libraries within the programming environment
- version of one of above
- (use of) virtual environments

- Towards solutions
- containers
- docker, singularity
- workflow languages
- CWL, snakemake, neftflow
- environment-agnostic
- formulate ins, outs, parameters




## Implementation: Make your ARC reproducible / executable with CWL

1. add workflows / scripts to `workflows`
2. Make workflows CWL-executable, by adding (parallel to the workflow / in the same workflows subdir) a .cwl file that
- describes the expected inputs, outputs, and parameters
3. Execute the workflow
1. "directly", calling the parameters via CLI

```bash
cwltool my_workflow.cwl -p1 parameter1 -p2 parameter2
```

2. referencing to a YAML file, that collects the required parameters
```bash
cwltool my_workflow.cwl my_workflow_parameters.yml
```

- use of paths / working directories
- runs folder
- Workflow metadata: my_workflow_parameters.yml

## Tutorial: CWL Generator quickstart

### Install

[gh-CWLgenerator][https://github.com/nfdi4plants/CWLGenerator]

### Dependencies

- Node.js (required for CWL Generator)
- cwltool / cwl-runner
- Docker (?)

### Recommendations

- VS code extension [CWL (Rabix/Benten)](https://marketplace.visualstudio.com/items?itemName=sbg-rabix.benten-cwl)


### Note / Typical errors

- (re)moved a required input or output
- cwltool can neither resolve "~" nor $HOME ?!
- let recurrent variables (script name, outfolder, etc.) come first
140 changes: 140 additions & 0 deletions src/docs/fundamentals/ReproduceReuse.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
---
layout: docs
title: Reproduce and reuse
date: 2022-09-23
author:
- name: Dominik Brilhaus
github: https://github.com/brilator
orcid: https://orcid.org/0000-0001-9021-3197
add toc: true
add sidebar: _sidebars/mainSidebar.md
status: draft
---

> This article is work-in-progress.
Key aspects of the [FAIR principles][kb-FairDataPrinciples] and driver for the development of good [RDM][kb-ResearchDataManagement] are *reproducibility* and *re-usability* (FAI**R**) of scientific outputs as well as workflows leading to these outputs. Although here we focus more on data and the "computational side", we would like to emphasize some analogies between **<span style="color:#FFC000">Data</span>** science and **<span style="color:#B4CE82">PLANT</span>** science. Especially as some requirements in both environments can at least in part be met with similar approaches.

Consider our PhD Viola (see [metadata][kb-Metadata]). In the <span style="color:#B4CE82">wet lab</span>, she extracts RNA from her plant samples using a ready-to-use <span style="color:#B4CE82">commercial extraction kit </span> with all buffers and some required materials and tools included. Similarly in the <span style="color:#FFC000">dry lab</span> she would use an <span style="color:#FFC000">established, commercial office software</span> that is mostly contained/isolated, for small spread-sheet calculations. There is no commercial kit available to extract metabolites suitable with the special plant species Viola is interested in. So she uses a <span style="color:#B4CE82">"manual" protocol</span> established in her lab, for which she orders and prepares buffers and solutions herself and gathers the required devices, tubes and materials. Once she receives her RNA-Seq data, she sets up her own <span style="color:#FFC000">combinations of scripts (pipeline)</span> with varying inputs (reference data sets) and tool dependencies (code interpreters, packages, functions). In the end, Viola's complete workspace, be it the <span style="color:#B4CE82">laboratory environment</span> or her computer's <span style="color:#FFC000">operating system</span>, comes with its specific setup, tools, resources and limitations. And her research routine would likely differ if she were to pursue it in a different lab or using another computer.

For both types of workflows, there are (clearly) defined inputs and outputs, e.g. the <span style="color:#B4CE82">state of the</span> or the <span style="color:#FFC000">data format</span>. And Viola makes sure to document as much metadata as possible to make her workflows reproducible, including e.g. <span style="color:#B4CE82">version, batch or LOT numbers</span> of a kit or chemical and the <span style="color:#FFC000">versions of software and packages</span>. Also trouble-shooting with a colleague, company, data steward or seeking help in online forums, is always easier if you share information about your setting.


<!-- ## Re-inventing the wheel -->
## On the shoulders of giants


"In real life" <!-- (in the living world, in biology) --> you can take a sample once and only once. You can take replicate samples &ndash; technical (same plant different leaf) or biological (different plant) &ndash;, but in the end this is a new and different sample. In the wet-lab many more factors affect reproducibility, making it close to impossible to reproduce the exact same outcome (results, datasets). These include biological variance, hands-on factors (more hands, bigger variance), the environment (humidity, temperature), but also deviations in standard devices (growth chamber, centrifuge).
- Still for other researchers to be able to re-use (i.e. build on) your findings, it will be helpful to document, metadata...


1. re-use an outcome (data or sample)
2. reproduce an outcome (peer-review)
3. re-use a workflow (lab protocol or analysis)



- Reproducibility of computational analyses
- a) you can "reproduce" that exact same output (run result) using the exact same inputs
- b) you can apply the analysis onto other data to produce analogous outputs, that can be fed into other workflows (e.g. generate similar figures)

- How we usually (learn to) work with scripts
- interactive, iterative
- adapt script to specific needs
- write (hard-code) inputs, outputs into script

- Problem
- hand script to colleague
- script not working due to missing (software) dependencies, changed (absolute) paths to environments / inputs / other dependencies (e.g. database resources)

- Example sources for scripts
- workshop / summer school
- colleagues
- manual / tutorial to a tool (downloaded and adapted from GitHub)
- copy/pasted from stack overflow

- Software dependencies
- on multiple levels / in different shapes
- operating system (Linux, Windows, Mac)
- programming environment / interpreter (shell, python, r, julia, f#)
- packages / libraries within the programming environment
- version of one of above
- (use of) virtual environments

- Towards solutions
- containers
- docker, singularity
- workflow languages
- CWL, snakemake, nextflow
- environment-agnostic
- formulate ins, outs, parameters
- workflow management systems
- galaxy





<!-- Links to DataPLANT knowledge base (kb-) -->

<!-- kb-Fundamentals -->

[kb-DataManagementPlan]: ../fundamentals/DataManagementPlan.html "Data Management Plan"
[kb-DataPublications]: ../fundamentals/DataPublications.html "Data Publication"
[kb-DataSharing]: ../fundamentals/DataSharing.html "Data Sharing"
[kb-FairDataPrinciples]: ../fundamentals/FairDataPrinciples.html "FAIR Data principles"
[kb-Metadata]: ../fundamentals/Metadata.html "Metadata"
[kb-PersistentIdentifiers]: ../fundamentals/PersistentIdentifiers.html "Persistent Identifiers"
[kb-PublicDataRepositories]: ../fundamentals/PublicDataRepositories.html "Repositories"
[kb-ResearchDataManagement]: ../fundamentals/ResearchDataManagement.html "Research Data Management"
[kb-VersionControlGit]: ../fundamentals/VersionControlGit.html "Version Control and Git"

<!-- kb-Implementation -->
[kb-AnnotatedResearchContext]: ../implementation/AnnotatedResearchContext.html "Annotated Research Context"
[kb-DataHub]: ../implementation/DataHub.html "DataPLANT DataHUB"
[kb-ArcCommander]: ../implementation/ArcCommander.html "DataPLANT ARC Commander"
[kb-Swate]: ../implementation/Swate.html "DataPLANT Swate"

<!-- kb-Tutorials -->
[kb-QuickStart_arc]: ../tutorials/QuickStart_arc.html "Quickstart ARC"
[kb-QuickStart_swate]: ../tutorials/QuickStart_swate.html "Quickstart Swate"
[kb-QuickStart_arcCommander]: ../tutorials/QuickStart_arcCommander.html "QuickStart ARC Commander"

<!-- Links to DataPLANT Homepage (hp-) -->

[hp-Registration]: <https://register.nfdi4plants.org/registration> "DataPLANT Registration"
[hp-DataHUB]: <https://git.nfdi4plants.org> "DataPLANT DataHUB"
[hp-HelpDesk]: <https://helpdesk.nfdi4plants.org> "DataPLANT Help Desk"

<!-- Links to DataPLANT GitHub (gh-) -->

[gh-DataPlant]: <https://github.com/nfdi4plants/> "GitHub DataPLANT"
[gh-ArcSpecs]: <https://github.com/nfdi4plants/ARC-specification/> "ARC specifications"
[gh-ArcCommander]: <https://github.com/nfdi4plants/arcCommander/> "ArcCommander"
[gh-ArcCommander-Wiki]: <https://github.com/nfdi4plants/arcCommander/wiki> "ArcCommander Wiki"
[gh-Swate]: <https://github.com/nfdi4plants/Swate/wiki> "Swate Wiki"

<!-- Links to external (ext-) sources -->

[ext-github-join]: <https://github.com/join/> "Join GitHub"
[ext-github-desktop]: <https://desktop.github.com/> "GitHub Desktop"
[ext-git]: <https://git-scm.com/download/> "Git"
[ext-git-lfs]: <https://git-lfs.github.com/> "Git-LFS"
[ext-excel-online]: <https://www.microsoft.com/en-us/microsoft-365/excel> "Excel online"

[ext-VSCode]: https://code.visualstudio.com/ "Visual Studio Code"

[ext-galaxy]: <https://plants.usegalaxy.eu/> "Galaxy Plants"
[ext-omero]: <https://www.openmicroscopy.org/omero/> "Omero"
[ext-zenodo]: <https://zenodo.org/> "Zenodo"
[ext-invenio]: <https://inveniosoftware.org/products/rdm/> "Invenio"
[ext-DataJournals]: https://www.researchdata.uni-jena.de/en/information/data-publication "RDM Jena Data Journals"

[ext-EBI-PRIDE]: https://www.ebi.ac.uk/pride/ "EBI PRIDE"
[ext-re3data]: https://www.re3data.org/ "re3data.org"
[ext-CreativeCommons]: https://creativecommons.org/ "Creative Commons"
[ext-DublinCore]: <https://www.dublincore.org/specifications/dublin-core/dcmi-terms/> "DublinCore"
[ext-DataCite]: <https://schema.datacite.org> "DataCite"
[fairsharing.org]: https://fairsharing.org/search?fairsharingRegistry=Standard "Standards at fairsharing.org"
[doi]: https://www.doi.org/ "Digital Object Identifier"
[orcid]: https://www.orcid.org/ "ORCID"

0 comments on commit 48c6eb1

Please sign in to comment.