Merge pull request #486 from nfdi4plants/reproducibility

Reproducibility
nfdi4plants · Oct 29, 2024 · 48c6eb1 · 48c6eb1
2 parents 346a445 + d7eccec
commit 48c6eb1
Show file tree

Hide file tree

Showing 2 changed files with 257 additions and 0 deletions.
diff --git a/src/docs/fundamentals/CommonWorkflowLanguage.md b/src/docs/fundamentals/CommonWorkflowLanguage.md
@@ -0,0 +1,117 @@
+---
+layout: docs
+title: Reproduce and reuse
+date: 2022-08-08
+author:
+- name: Dominik Brilhaus
+  github: https://github.com/brilator
+  orcid: https://orcid.org/0000-0001-9021-3197
+add toc: true
+add sidebar: _sidebars/mainSidebar.md
+status: draft
+---
+
+> Note: This is just a first collection of thoughts.
+> Could be partitioned into fundamentals/implementation/tutorial
+
+## Fundamentals: (code / software) reproducibility
+
+
+Reproducibility in science overall
+
+wet lab | dry lab
+--- | ---
+company RNA extraction kit with all buffers and most of materials and tools| established / (commercial) software; somewhat contained, isolated, self-sustained
+"manual" protocol where you buy and mix buffers together yourself | script or combinations of scripts (pipeline) with varying inputs (reference data sets) and tool dependencies (code interpreters, packages, functions)
+version, batch or LOT number | software / package version
+laboratory environment | operating system 
+
+
+In the wet-lab many more factors affect reproducibility, making it close to impossible to reproduce the exact same outcomes (results, datasets)
+- biological variance
+- hands-on factor (more hands, bigger variance)
+- environment (humidity, temperature), but also standard devices (growth chamber, centrifuge)
+
+
+
+- Reproducibility of computational analyses
+  - a) you can "reproduce" that exact same output (run result) using the exact same inputs
+  - b) you can apply the analysis onto other data to produce analogous outputs, that can be fed into other workflows (e.g. generate similar figures)
+
+- How we usually (learn to) work with scripts
+  - interactive, iterative
+  - adapt script to specific needs
+  - write (hard-code) inputs, outputs into script
+
+- Problem
+  - hand script to colleague
+  - script not working due to missing (software) dependencies, changed (absolute) paths to environments / inputs / other dependencies (e.g. database resources)
+
+- Example sources for scripts
+  - workshop / summer school
+  - colleagues
+  - manual / tutorial to a tool (downloaded and adapted from GitHub)
+  - copy/pasted from stack overflow
+
+- Software dependencies
+  - on multiple levels / in different shapes
+  - operating system (Linux, Windows, Mac)
+  - programming environment / interpreter (shell, python, r, julia, f#)
+  - packages / libraries within the programming environment
+  - version of one of above
+  - (use of) virtual environments
+
+- Towards solutions
+  - containers
+    - docker, singularity
+  - workflow languages
+    - CWL, snakemake, neftflow
+    - environment-agnostic
+    - formulate ins, outs, parameters
+
+
+
+
+## Implementation: Make your ARC reproducible / executable with CWL
+
+1. add workflows / scripts to `workflows`
+2. Make workflows CWL-executable, by adding (parallel to the workflow / in the same workflows subdir) a .cwl file that
+   - describes the expected inputs, outputs, and parameters
+3. Execute the workflow
+   1. "directly", calling the parameters via CLI
+
+    ```bash
+    cwltool my_workflow.cwl -p1 parameter1 -p2 parameter2
+    ```
+
+   2. referencing to a YAML file, that collects the required parameters
+    ```bash
+    cwltool my_workflow.cwl my_workflow_parameters.yml
+    ```
+
+- use of paths / working directories
+- runs folder
+- Workflow metadata: my_workflow_parameters.yml
+
+## Tutorial: CWL Generator quickstart
+
+### Install
+
+[gh-CWLgenerator][https://github.com/nfdi4plants/CWLGenerator]
+
+### Dependencies
+
+- Node.js (required for CWL Generator)
+- cwltool / cwl-runner
+- Docker (?)
+
+### Recommendations
+
+- VS code extension [CWL (Rabix/Benten)](https://marketplace.visualstudio.com/items?itemName=sbg-rabix.benten-cwl)
+
+
+### Note / Typical errors
+
+- (re)moved a required input or output
+- cwltool can neither resolve "~" nor $HOME ?!
+- let recurrent variables (script name, outfolder, etc.) come first
diff --git a/src/docs/fundamentals/ReproduceReuse.md b/src/docs/fundamentals/ReproduceReuse.md
@@ -0,0 +1,140 @@
+---
+layout: docs
+title: Reproduce and reuse
+date: 2022-09-23
+author:
+- name: Dominik Brilhaus
+  github: https://github.com/brilator
+  orcid: https://orcid.org/0000-0001-9021-3197
+add toc: true
+add sidebar: _sidebars/mainSidebar.md
+status: draft
+---
+
+> This article is work-in-progress.
+
+Key aspects of the [FAIR principles][kb-FairDataPrinciples] and driver for the development of good [RDM][kb-ResearchDataManagement] are *reproducibility* and *re-usability* (FAI**R**) of scientific outputs as well as workflows leading to these outputs. Although here we focus more on data and the "computational side", we would like to emphasize some analogies between **<span style="color:#FFC000">Data</span>** science and **<span style="color:#B4CE82">PLANT</span>** science. Especially as some requirements in both environments can at least in part be met with similar approaches.
+
+Consider our PhD Viola (see [metadata][kb-Metadata]). In the <span style="color:#B4CE82">wet lab</span>, she extracts RNA from her plant samples using a ready-to-use <span style="color:#B4CE82">commercial extraction kit </span> with all buffers and some required materials and tools included. Similarly in the <span style="color:#FFC000">dry lab</span> she would use an <span style="color:#FFC000">established, commercial office software</span> that is mostly contained/isolated, for small spread-sheet calculations. There is no commercial kit available to extract metabolites suitable with the special plant species Viola is interested in. So she uses a <span style="color:#B4CE82">"manual" protocol</span> established in her lab, for which she orders and prepares buffers and solutions herself and gathers the required devices, tubes and materials. Once she receives her RNA-Seq data, she sets up her own <span style="color:#FFC000">combinations of scripts (pipeline)</span> with varying inputs (reference data sets) and tool dependencies (code interpreters, packages, functions). In the end, Viola's complete workspace, be it the <span style="color:#B4CE82">laboratory environment</span> or her computer's <span style="color:#FFC000">operating system</span>, comes with its specific setup, tools, resources and limitations. And her research routine would likely differ if she were to pursue it in a different lab or using another computer.
+
+For both types of workflows, there are (clearly) defined inputs and outputs, e.g. the <span style="color:#B4CE82">state of the</span> or the <span style="color:#FFC000">data format</span>. And Viola makes sure to document as much metadata as possible to make her workflows reproducible, including e.g. <span style="color:#B4CE82">version, batch or LOT numbers</span> of a kit or chemical and the <span style="color:#FFC000">versions of software and packages</span>. Also trouble-shooting with a colleague, company, data steward or seeking help in online forums, is always easier if you share information about your setting.
+
+
+<!-- ## Re-inventing the wheel -->
+## On the shoulders of giants
+
+
+"In real life" <!-- (in the living world, in biology) --> you can take a sample once and only once. You can take replicate samples &ndash; technical (same plant different leaf) or biological (different plant) &ndash;, but in the end this is a new and different sample. In the wet-lab many more factors affect reproducibility, making it close to impossible to reproduce the exact same outcome (results, datasets). These include biological variance, hands-on factors (more hands, bigger variance), the environment (humidity, temperature), but also deviations in standard devices (growth chamber, centrifuge).
+  - Still for other researchers to be able to re-use (i.e. build on) your findings, it will be helpful to document, metadata... 
+
+
+1. re-use an outcome (data or sample)
+2. reproduce an outcome (peer-review)
+3. re-use a workflow (lab protocol or analysis)
+
+
+
+- Reproducibility of computational analyses
+  - a) you can "reproduce" that exact same output (run result) using the exact same inputs
+  - b) you can apply the analysis onto other data to produce analogous outputs, that can be fed into other workflows (e.g. generate similar figures)
+
+- How we usually (learn to) work with scripts
+  - interactive, iterative
+  - adapt script to specific needs
+  - write (hard-code) inputs, outputs into script
+
+- Problem
+  - hand script to colleague
+  - script not working due to missing (software) dependencies, changed (absolute) paths to environments / inputs / other dependencies (e.g. database resources)
+
+- Example sources for scripts
+  - workshop / summer school
+  - colleagues
+  - manual / tutorial to a tool (downloaded and adapted from GitHub)
+  - copy/pasted from stack overflow
+
+- Software dependencies
+  - on multiple levels / in different shapes
+  - operating system (Linux, Windows, Mac)
+  - programming environment / interpreter (shell, python, r, julia, f#)
+  - packages / libraries within the programming environment
+  - version of one of above
+  - (use of) virtual environments
+
+- Towards solutions
+  - containers
+    - docker, singularity
+  - workflow languages
+    - CWL, snakemake, nextflow
+    - environment-agnostic
+    - formulate ins, outs, parameters
+  - workflow management systems
+    - galaxy
+
+
+
+
+
+<!-- Links to DataPLANT knowledge base (kb-) -->
+
+<!-- kb-Fundamentals -->
+
+[kb-DataManagementPlan]: ../fundamentals/DataManagementPlan.html "Data Management Plan"
+[kb-DataPublications]: ../fundamentals/DataPublications.html "Data Publication"
+[kb-DataSharing]: ../fundamentals/DataSharing.html "Data Sharing"
+[kb-FairDataPrinciples]: ../fundamentals/FairDataPrinciples.html "FAIR Data principles"
+[kb-Metadata]: ../fundamentals/Metadata.html "Metadata"
+[kb-PersistentIdentifiers]: ../fundamentals/PersistentIdentifiers.html "Persistent Identifiers"
+[kb-PublicDataRepositories]: ../fundamentals/PublicDataRepositories.html "Repositories"
+[kb-ResearchDataManagement]: ../fundamentals/ResearchDataManagement.html "Research Data Management"
+[kb-VersionControlGit]: ../fundamentals/VersionControlGit.html "Version Control and Git"
+
+<!-- kb-Implementation -->
+[kb-AnnotatedResearchContext]: ../implementation/AnnotatedResearchContext.html "Annotated Research Context"
+[kb-DataHub]: ../implementation/DataHub.html "DataPLANT DataHUB"
+[kb-ArcCommander]: ../implementation/ArcCommander.html "DataPLANT ARC Commander"
+[kb-Swate]: ../implementation/Swate.html "DataPLANT Swate"
+
+<!-- kb-Tutorials -->
+[kb-QuickStart_arc]: ../tutorials/QuickStart_arc.html "Quickstart ARC"
+[kb-QuickStart_swate]: ../tutorials/QuickStart_swate.html "Quickstart Swate"
+[kb-QuickStart_arcCommander]: ../tutorials/QuickStart_arcCommander.html "QuickStart ARC Commander"
+
+<!-- Links to DataPLANT Homepage (hp-) -->
+
+[hp-Registration]: <https://register.nfdi4plants.org/registration> "DataPLANT Registration"
+[hp-DataHUB]: <https://git.nfdi4plants.org> "DataPLANT DataHUB"
+[hp-HelpDesk]: <https://helpdesk.nfdi4plants.org> "DataPLANT Help Desk"
+
+<!-- Links to DataPLANT GitHub (gh-) -->
+
+[gh-DataPlant]: <https://github.com/nfdi4plants/> "GitHub DataPLANT"
+[gh-ArcSpecs]: <https://github.com/nfdi4plants/ARC-specification/> "ARC specifications"
+[gh-ArcCommander]: <https://github.com/nfdi4plants/arcCommander/> "ArcCommander"
+[gh-ArcCommander-Wiki]: <https://github.com/nfdi4plants/arcCommander/wiki> "ArcCommander Wiki"
+[gh-Swate]: <https://github.com/nfdi4plants/Swate/wiki> "Swate Wiki"
+
+<!-- Links to external (ext-) sources -->
+
+[ext-github-join]: <https://github.com/join/> "Join GitHub"
+[ext-github-desktop]: <https://desktop.github.com/> "GitHub Desktop"
+[ext-git]: <https://git-scm.com/download/> "Git"
+[ext-git-lfs]: <https://git-lfs.github.com/> "Git-LFS"
+[ext-excel-online]: <https://www.microsoft.com/en-us/microsoft-365/excel> "Excel online"
+
+[ext-VSCode]: https://code.visualstudio.com/ "Visual Studio Code"
+
+[ext-galaxy]: <https://plants.usegalaxy.eu/> "Galaxy Plants"
+[ext-omero]: <https://www.openmicroscopy.org/omero/> "Omero"
+[ext-zenodo]: <https://zenodo.org/> "Zenodo"
+[ext-invenio]: <https://inveniosoftware.org/products/rdm/> "Invenio"
+[ext-DataJournals]: https://www.researchdata.uni-jena.de/en/information/data-publication "RDM Jena Data Journals"
+
+[ext-EBI-PRIDE]: https://www.ebi.ac.uk/pride/ "EBI PRIDE"
+[ext-re3data]: https://www.re3data.org/ "re3data.org"
+[ext-CreativeCommons]: https://creativecommons.org/ "Creative Commons"
+[ext-DublinCore]: <https://www.dublincore.org/specifications/dublin-core/dcmi-terms/> "DublinCore"
+[ext-DataCite]: <https://schema.datacite.org>  "DataCite"
+[fairsharing.org]: https://fairsharing.org/search?fairsharingRegistry=Standard "Standards at fairsharing.org"
+[doi]: https://www.doi.org/ "Digital Object Identifier"
+[orcid]: https://www.orcid.org/ "ORCID"