diff --git a/src/content/docs/fundamentals/reproduce-reuse.md b/src/content/docs/fundamentals/reproduce-reuse.md new file mode 100644 index 000000000..eef925141 --- /dev/null +++ b/src/content/docs/fundamentals/reproduce-reuse.md @@ -0,0 +1,80 @@ +--- +title: Reproduce and reuse +lastUpdated: 2022-09-23 +authors: + - dominik-brilhaus +draft: true +pagefind: false +--- + +This guide outlines key principles and practical steps for achieving reproducibility in both wet-lab and computational (dry-lab) environments. It aims to help researchers in biological sciences understand and implement reproducibility in their work, ensuring that experimental outcomes, data, and analyses can be reliably repeated by others. + +## Reproducibility in Science: Wet-Lab vs Dry-Lab + +| Wet Lab | Dry Lab | +| ------- | ------- | +| Company RNA extraction kit with all buffers and most materials and tools | Established/commercial software; somewhat contained, isolated, self-sustained | +| "Manual" protocol where you mix buffers together yourself | Scripts or combinations of scripts (pipelines) with varying inputs (reference datasets) and tool dependencies (code interpreters, packages, functions) | +| Version, batch, or lot number of materials | Software/package version | +| Laboratory environment (humidity, temperature, equipment) | Operating system (Linux, Windows, Mac) | + +## Challenges in Wet-Lab Reproducibility + +In the wet-lab, many factors influence reproducibility, making it difficult to recreate the exact same results. These factors include: + +- **Biological variance**: Even with the same protocols and conditions, biological systems often exhibit inherent variability. +- **Hands-on factors**: More individuals handling the experiment can introduce variability. +- **Environmental factors**: Humidity, temperature, and even the specific equipment used (e.g., centrifuges, growth chambers) can affect results. + +## Reproducibility in Computational Analyses + +Reproducibility in computational analyses generally focuses on two key aspects: + +1. **Exact output reproduction**: Ensuring that the same input data will consistently yield the same result when the analysis is rerun. +2. **Flexible workflow application**: Ensuring that workflows and analysis pipelines can be applied to different datasets, producing analogous results that can be fed into other analyses or workflows (e.g., generating similar figures). + +## How We Typically Work with Scripts in Computational Workflows + +In computational biology, scripts are often: + +- **Interactive and iterative**: Researchers frequently modify and rerun scripts in response to their data or research questions. +- **Adapted for specific needs**: Researchers often adapt generic scripts to their specific datasets, tweaking them as they go. +- **Hard-coded**: Inputs, outputs, and parameters are sometimes hard-coded directly into the script, which can lead to issues when sharing or transferring the script to others. + +## Common Problems with Reproducibility in Computational Workflows + +One of the main challenges in reproducibility is sharing scripts with others: + +- **Missing dependencies**: When passing a script to a colleague, it might not work because of missing software dependencies, different versions of libraries, or changed file paths. +- **Environmental differences**: Different operating systems, system configurations, or setups may lead to issues in running the script as intended. + +## Common Sources for Scripts + +Researchers typically source scripts from: + +- **Workshops or summer schools**: Scripts often come from educational events and are adapted for specific use. +- **Colleagues**: Researchers share their scripts with peers, who then modify them for their own needs. +- **Manuals or tutorials**: Many scripts are adapted from tutorials available online (e.g., from GitHub repositories). +- **Community forums**: Script snippets often come from community-driven sites like Stack Overflow. + +## Software Dependencies and Environment Management + +Reproducibility can break down due to the numerous dependencies and system requirements involved in computational workflows. These include: + +- **Operating systems**: Different platforms (Linux, Windows, Mac) can affect how software runs. +- **Programming environments**: Variations in the programming language (e.g., Python, R, Julia) or the environment (e.g., Shell, Jupyter notebooks) can cause inconsistencies. +- **Package versions**: Even the same software package can behave differently between versions, leading to unexpected results. +- **Virtual environments**: Without using tools like virtual environments or containers, different users might have conflicting software setups. + +## Solutions for Reproducibility + +Several tools and approaches can help address these issues and improve reproducibility: + +- **Containers**: Using Docker or Singularity allows you to package software, dependencies, and environments into a portable container that can be executed consistently across different systems. +- **Workflow languages**: Tools like **CWL** (Common Workflow Language), **Snakemake**, and **Nextflow** help create standardized workflows that are environment-agnostic, specifying input/output parameters and dependencies in a way that’s easy to share and reproduce. + +## Towards a Reproducible Research Environment + +Reproducibility is a critical principle in both biological and computational research. By carefully structuring your workflows, using version control, managing dependencies with tools like containers and CWL, and applying FAIR principles to your data, you can ensure that your research can be reliably reproduced and shared. + +By adopting these practices, you’ll not only improve the robustness and transparency of your own work, but also make it easier for others to build upon your research in the future. diff --git a/src/content/docs/guides/arc-practical-entry.mdx b/src/content/docs/guides/arc-practical-entry.mdx new file mode 100644 index 000000000..2e6496ed0 --- /dev/null +++ b/src/content/docs/guides/arc-practical-entry.mdx @@ -0,0 +1,91 @@ +--- +title: Creating an ARC for Your Project +lastUpdated: 2024-11-13 +authors: + - dominik-brilhaus +sidebar: + order: 0 + badge: + text: new + variant: tip +--- + +You followed Viola's steps during the [start here](/nfdi4plants.knowledgebase/start-here/) guide and are now overwhelmed? Sure, a guide streamlined onto a demo dataset is a whole different story than achieving this with your own complex data. +Here we provide recommendations and considerations for structuring an ARC based on **your current project and datasets**. Remember: creating an ARC is an ongoing process, and it’s meant to evolve over time. + + +## The "final" ARC does not exist – Immutable, Yet Evolving! + +Think of your ARC as an evolving entity that adapts and improves as your project progresses. + +- **Don't aim for perfection right away:** At first, your ARC doesn't need to be flawless. You’re not expected to win an award for the best ARC from the outset. The goal is for it to be useful to **you**. As long as your ARC serves its purpose—whether by organizing data, tracking workflows, or aiding in reproducibility—that’s a win. +- **Priorities vary across researchers:** Different people may have different ideas about what should be made FAIR first and what can be polished later. Allow yourself to start with the basics and improve it step by step. + +So, **don't stress** about making your ARC perfect from the get-go—focus on making it functional. + +## Start Simple: Just Dump the Files Into Your ARC + +An ARC’s core principle is that "everything is a file." It’s common to work with a collection of files and folders in your daily research. Why not just start by organizing them into an ARC? + +- **Initial File Dump:** At first, don’t worry too much about the precise structure. Simply place your files into an “**additional payload**” folder within the ARC. This will help you get started without overthinking the details. +- **Version Control with Git:** By putting your files in the ARC, you instantly gain the benefit of [version control through Git](/nfdi4plants.knowledgebase/fundamentals/version-control-git). This helps you track changes and maintain a history of your files. +- **Safe Backup via [DataHUB](/nfdi4plants.knowledgebase/datahub):** Once you upload your ARC to the [DataHUB](/nfdi4plants.knowledgebase/datahub), you’ll also have a secure backup of your files. + +:::tip +If you’re dealing with large files (e.g., raw sequencing data), you can initially store them anywhere. Just make sure they’re tracked with [Git LFS (Large File Storage)](/nfdi4plants.knowledgebase/git/git-lfs). This way, you can later move the LFS pointers into your ARC without dealing with the actual large files. +::: + +## Add Metadata to Make Your ARC More Shareable and Citable + +Next, enrich your ARC with some **basic metadata**: + +- **Project and Creator Info:** Include metadata about your project and the researchers involved. This step makes your ARC more sharable and **citable** from the start. +- **Link to the Investigation:** Add this metadata to your `investigation` section. This is an easy way to ensure your work is discoverable and properly credited. + +## Sketch Your Laboratory Workflows + +A key goal of an ARC is to trace each finding or result back to its originating biological experiment. To achieve this, your ARC will need to link dataset files to individual samples through a series of **processes** (laboratory or computational steps) with defined **inputs** and **outputs**. + +- **Map Out Your Lab Workflows:** Before diving into the structure of your ARC, take some time to **sketch** what you did in the lab. What experiments did you perform? What samples did you analyze? Which protocols did you follow? This sketch will help you understand how to organize your data and workflows later. + +--- + +## Organize Your Files into `studies` and `assays` + +Once you have a better understanding of your lab processes, you can begin organizing your ARC: + +- **Define `studies` and `assays`:** Structure your data by moving files into relevant folders, such as `studies` and `assays`. This makes it clear where the raw data (`dataset`) is stored and which protocols were used to generate that data. +- **Reference Protocols:** As you organize, simply reference the **existing protocols** (stored as free-text documents) in your ARC. This ensures consistency without overwhelming you with unnecessary details at this stage. + +## Simple First: Link `Input` and `Output` Nodes + +Before delving into complex parameterization or detailed annotation tables, start simple: + +- **Connect Inputs and Outputs:** Begin by connecting your `studies` and `assays` through **input** and **output** nodes. This allows you to trace the flow of data through your workflows without getting bogged down by excessive detail. +- **Re-draw Lab Workflows:** At this stage, you can essentially redraw your lab workflows as tables, mapping each process step to its inputs and outputs. + +## Parameterize Your Protocols for Machine Readability + +Once you have the basic structure in place, you can start making your data more **machine-readable** and **searchable**: + +- **Parameterize Protocols:** To improve reproducibility, break down your protocols and workflows into structured annotation tables. This will allow you to capture the parameters used at each step of your research. +- **Make It Searchable:** This will make your study more **discoverable** and ensure that your methods are clear and reproducible. + +## Keep It Simple for Your Data Analysis Workflows + +The same approach applies to your data analysis workflows: + +- **Treat Data Analysis as Protocols:** Regardless of whether your data analysis involves clickable software or custom code, treat it like a **protocol**. For now, just store the results in your `dataset` folder. +- **Iterate as You Go:** You don’t need to go into deep detail at first. Just focus on capturing the core analysis steps, and refine them later as your project progresses. + +## Making Data Analysis More Reproducible: Use CWL, Containers, and Dependency Management + +If you want to make your data analysis more **reproducible** and ensure that your workflows are **easily reusable**, consider wrapping your analysis tools in **CWL** (Common Workflow Language) and using **containers**: + +- **CWL for Reproducibility:** Use CWL to describe your computational workflows in a standardized way. This ensures that others can run your analysis with the same inputs and parameters, regardless of their system. +- **Containerization:** Leverage Docker or Singularity containers to encapsulate all software dependencies. This makes it easier to share your workflows and ensures they run consistently across different environments. +- **Manage Dependencies:** Use tools like Conda or Docker to manage your software dependencies, avoiding issues with mismatched versions or missing libraries. + +## **Conclusion: The ARC is a Living FAIR Digital Object** + +The process of creating an ARC is **gradual** and **evolving**. Start simple, and focus on getting the basics in place. Over time, you can refine and enhance your ARC to improve its usefulness and functionality, making it a valuable tool for organizing, sharing, and reproducing your research. diff --git a/src/content/docs/fundamentals/best-practices-for-data-annotation.md b/src/content/docs/guides/best-practices-for-data-annotation.md similarity index 100% rename from src/content/docs/fundamentals/best-practices-for-data-annotation.md rename to src/content/docs/guides/best-practices-for-data-annotation.md diff --git a/src/content/docs/vault/ARC-practical-entry-stepwise.md b/src/content/docs/vault/ARC-practical-entry-stepwise.md deleted file mode 100644 index 14e17a047..000000000 --- a/src/content/docs/vault/ARC-practical-entry-stepwise.md +++ /dev/null @@ -1,209 +0,0 @@ ---- -title: Practical Guide into the ARC ecosystem -lastUpdated: 2023-11-29 -authors: - - dominik-brilhaus -draft: true -hidden: true -pagefind: false ---- - -## About this guide - -In this guide we collect recommendations and considerations on creating an ARC based on your current project and datasets - - -## Convert your project into an ARC - -- you have files and folders -- they are stored somewhere -- pack them / decorate them in an ARC. - -## Sketch your laboratory workflows - -One goal of the ARC is to be able to tell, which finding or result originated from which biological experiment. This would ultimately require to link the dataset files back to the individual sample. To do so, we essentially follow a path of *processes* with *inputs* and *outputs*. Some of the inputs and outputs want to be reused or reproduced, some of the processes want to be applied to other inputs. - -Before creating an ARC for an existing dataset, it might help to visualize what was done in the lab. The following is very simplified example that most plant biologists can hopefully relate to. - -```mermaid - -%%{ - init: { - 'theme': 'base', - 'themeVariables': { - 'background': '#fff', - 'lineColor': '#2d3e50', - 'primaryTextColor': '#2d3e50' - } - } -}%% - -graph TD - -%% Nodes - S1(Seeds) - - S2(Leaves) - - M1(RNA) - M2(protein) - M3(cDNA) - M4(RNASeq Libraries) - M5(SDS-gel) - M6(western blot) - - P1>plant growth] - P2>RNA extraction] - P3>protein extraction] - P4>cDNA synthesis] - P5>qRT-PCR] - P6>Library preparation] - P7>Next Generation Sequencing] - P8>SDS Page] - P9>taking a photo] - P10>Immunoblotting] - P11>mapping] - - D1("qRT results") - D2(fastq files) - D3(Image of \n SDS gel) - D4(reference \n genome) - D5(count table) - - -%% Links - -subgraph Studies - subgraph study:drought - S1 ---P1--drought\nstress--> S2 - end - - subgraph study:heat - P13>plant growth] - - S1 ---P13--heat\nstress--> S4(Leaves) - end - - subgraph study:genome-ref - P12>Download] - x(Paper supplement) ---P12--> D4 - end - -end - - -subgraph Assays - - subgraph assay:Another Assay - P14>Process XY] - D6(Output XY) - - S4 ---P14--> D6 - end - - subgraph assay:qRT-PCR - S2 ---P2--> M1 - M1 ---P4--> M3 - M3 ---P5--> D1 - end - - subgraph assay:SDS-gel - S2 ---P3--> M2 - M2 ---P8--> M5 - M5 ---P9--> D3 - end - - subgraph assay:RNA-Seq - M1 ---P6--> M4 - M4 ---P7--> D2 - end - - subgraph assay:western Blot - M5 ---P10--> M6 - end - -end - -subgraph Worklows/Runs - - subgraph workflow:mapping - D2 --- P11 - D4 --- P11 - end - - subgraph run - P11 --> D5 - end - -end - - - - -%% Add legend -subgraph Legend - Sx(Sample) - Mx(Material) - Dx(Data) - Px>Process] -end - -%% Defining node styles - classDef S fill:#b4ce82, stroke:#333; - classDef M fill:#ffc000; - classDef D fill:#c21f3a,color:white; - classDef P stroke-width:0px; - -%% Assigning styles to nodes - class Sx,S1,S2,S4 S; - class Mx,M1,M2,M3,M4,M5,M6 M; - class Dx,D1,D2,D3,D4,D5,D6 D; - class Px,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13,P14 P; - -%% Box style -style Worklows/Runs fill:#fff, stroke-width:2px, stroke:#333; -style Studies fill:#fff, stroke-width:2px, stroke:#333; -style Assays fill:#fff, stroke-width:2px, stroke:#333; - -``` - -:bulb: On a side note, the above is a very wet-lab heavy example. However, conceptually the same applies to computational workflows. Coders oftentimes design their scripts, workflows and pipelines in successive modules with defined inputs and outputs. - - -## Now action - -Once - - - - -## Work with identifiers - -The ARC and the ISA metadata model offer determined places to - -- `Input` and `Output` fields such as Source Name, Sample Name, Data File Names -- `Protocol REF` - - -### Every file (name) is an identifier diff --git a/src/content/docs/vault/ARC-practical-entry.md b/src/content/docs/vault/ARC-practical-entry.md deleted file mode 100644 index be98e1102..000000000 --- a/src/content/docs/vault/ARC-practical-entry.md +++ /dev/null @@ -1,623 +0,0 @@ ---- -title: Practical Guide into the ARC ecosystem -lastUpdated: 2023-11-29 -authors: - - dominik-brilhaus -draft: true -hidden: true -pagefind: false ---- - -## About this guide - -In this guide we collect recommendations and considerations on creating an ARC based on your current project and datasets - - -## Convert your project into an ARC - -- you have files and folders -- they are stored somewhere -- pack them / decorate them in an ARC. - - -## Sketch your laboratory workflows - -One goal of the ARC is to be able to tell, which finding or result originated from which biological experiment. This would ultimately require to link the dataset files back to the individual sample. To do so, we essentially follow a path of *processes* with *inputs* and *outputs*. Some of the inputs and outputs want to be reused or reproduced, some of the processes want to be applied to other inputs. - -Before creating an ARC for an existing dataset, it might help to visualize what was done in the lab. The following is very simplified example that most plant biologists can hopefully relate to. - - -### Green-house to gene expression - -Consider you want to investigate the effect of drought stress on the transcript levels of you gene of interest (GOI) via qRT-PCR. You grow plants from seeds, drought-stress the plants and collect leaves at the end of the growth study. From the leave samples – homogenized to powder and stored in a freezer – you take an aliquot to extract RNA, from which you synthesize cDNA. The cDNA (together with other biologicals and chemicals) is the input for a qRT-PCR yielding relative transcript levels as the output. - -```mermaid - -%%{ - init: { - 'theme': 'base', - 'themeVariables': { - 'background': '#fff', - 'lineColor': '#2d3e50', - 'primaryTextColor': '#2d3e50' - } - } -}%% - -flowchart LR - -%% Nodes - S1(Seeds) - S2(Leaves) - - M1(RNA) - M3(cDNA) - - P1>plant growth] - P2>RNA extraction] - P4>cDNA synthesis] - P5>qRT-PCR] - - D1("qRT results") - -%% Links - - S1 ---P1--drought\nstress--> S2 - S2 ---P2--> M1 - M1 ---P4--> M3 - M3 ---P5--> D1 - -%% Defining node styles - classDef S fill:#b4ce82, stroke:#333; - classDef M fill:#ffc000; - classDef D fill:#c21f3a,color:white; - classDef P stroke-width:0px; - -%% Assigning styles to nodes - class Sx,S1,S2 S; - class Mx,M1,M2,M3,M4,M5,M6 M; - class Dx,D1,D2,D3,D4,D5 D; - class Px,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13 P; - -``` - -### Confirm findings on protein level - -You found your GOI affected by drought stress on transcript level. To confirm that the expression of the encoded protein is likewise affected, you take another aliquot from the same leave samples, extract proteins, separate them by SDS-PAGE and immunoblot the SDS gel with antibodies specific for your GOI. - -```mermaid - -%%{ - init: { - 'theme': 'base', - 'themeVariables': { - 'background': '#fff', - 'lineColor': '#2d3e50', - 'primaryTextColor': '#2d3e50' - } - } -}%% - - -graph LR - -%% Nodes - S1(Seeds) - S2(Leaves) - - M1(RNA) - M2(protein) - M3(cDNA) - M5(SDS-gel) - M6(western blot) - - P1>plant growth] - P2>RNA extraction] - P3>protein extraction] - P4>cDNA synthesis] - P5>qRT-PCR] - P8>SDS Page] - P9>taking a photo] - P10>Immunoblotting] - - D1("qRT results") - D3(Image of \n SDS gel) - -%% Links - - S1 ---P1--drought\nstress--> S2 - - S2 ---P2--> M1 - S2 ---P3--> M2 - M1 ---P4--> M3 - M3 ---P5--> D1 - M2 ---P8--> M5 - M5 ---P9--> D3 - M5 ---P10--> M6 - -%% Defining node styles - classDef S fill:#b4ce82, stroke:#333; - classDef M fill:#ffc000; - classDef D fill:#c21f3a,color:white; - classDef P stroke-width:0px; - -%% Assigning styles to nodes - class Sx,S1,S2 S; - class Mx,M1,M2,M3,M4,M5,M6 M; - class Dx,D1,D2,D3,D4,D5 D; - class Px,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13 P; - -``` - -### Global overview of gene expression - -You could show that the expression of your GOI was affected by drought on both transcript and protein level. In order to identify transcripts that correlate with your GOI under drought stress, you prepare RNA extracted earlier and submit it to a company for mRNA-Seq. - - -```mermaid - -%%{ - init: { - 'theme': 'base', - 'themeVariables': { - 'background': '#fff', - 'lineColor': '#2d3e50', - 'primaryTextColor': '#2d3e50' - } - } -}%% - - -graph LR - -%% Nodes - S1(Seeds) - - S2(Leaves) - - M1(RNA) - M2(protein) - M3(cDNA) - M4(RNASeq Libraries) - M5(SDS-gel) - M6(western blot) - - P1>plant growth] - P2>RNA extraction] - P3>protein extraction] - P4>cDNA synthesis] - P5>qRT-PCR] - P6>Library preparation] - P7>Next Generation Sequencing] - P8>SDS Page] - P9>taking a photo] - P10>Immunoblotting] - - D1("qRT results") - D2(fastq files) - D3(Image of \n SDS gel) - -%% Links - -S1 ---P1--drought\nstress--> S2 - - S2 ---P2--> M1 - S2 ---P3--> M2 - M1 ---P4--> M3 - M3 ---P5--> D1 - M1 ---P6--> M4 - M4 ---P7--> D2 - M2 ---P8--> M5 - M5 ---P9--> D3 - M5 ---P10--> M6 - -%% Defining node styles - classDef S fill:#b4ce82, stroke:#333; - classDef M fill:#ffc000; - classDef D fill:#c21f3a,color:white; - classDef P stroke-width:0px; - -%% Assigning styles to nodes - class Sx,S1,S2,S4 S; - class Mx,M1,M2,M3,M4,M5,M6 M; - class Dx,D1,D2,D3,D4,D5 D; - class Px,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13 P; - -``` - -### Adding external data - -From the company you receive the RNA-Seq reads in form of fastq files. In order to quantify the reads and generate a count table, you map them against a suitable reference genome downloaded from an online database or publication's supplemental data. - - -```mermaid - -%%{ - init: { - 'theme': 'base', - 'themeVariables': { - 'background': '#fff', - 'lineColor': '#2d3e50', - 'primaryTextColor': '#2d3e50' - } - } -}%% - - -graph LR - -%% Nodes - S1(Seeds) - - S2(Leaves) - - M1(RNA) - M2(protein) - M3(cDNA) - M4(RNASeq Libraries) - M5(SDS-gel) - M6(western blot) - - P1>plant growth] - P2>RNA extraction] - P3>protein extraction] - P4>cDNA synthesis] - P5>qRT-PCR] - P6>Library preparation] - P7>Next Generation Sequencing] - P8>SDS Page] - P9>taking a photo] - P10>Immunoblotting] - P11>mapping] - - D1("qRT results") - D2(fastq files) - D3(Image of \n SDS gel) - D4(reference \n genome) - D5(count table) - -%% Links - S1 ---P1--drought\nstress--> S2 - P12>Download] - x(Paper supplement) ---P12--> D4 - - S2 ---P2--> M1 - S2 ---P3--> M2 - M1 ---P4--> M3 - M3 ---P5--> D1 - M1 ---P6--> M4 - M4 ---P7--> D2 - D2 --- P11 - D4 --- P11 - P11 --> D5 - M2 ---P8--> M5 - M5 ---P9--> D3 - M5 ---P10--> M6 - - -%% Defining node styles - classDef S fill:#b4ce82, stroke:#333; - classDef M fill:#ffc000; - classDef D fill:#c21f3a,color:white; - classDef P stroke-width:0px; - -%% Assigning styles to nodes - class Sx,S1,S2,S4 S; - class Mx,M1,M2,M3,M4,M5,M6 M; - class Dx,D1,D2,D3,D4,D5 D; - class Px,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13 P; - -``` - -### What this could look like in an ARC - -```mermaid - -%%{ - init: { - 'theme': 'base', - 'themeVariables': { - 'background': '#fff', - 'lineColor': '#2d3e50', - 'primaryTextColor': '#2d3e50' - } - } -}%% - - -graph LR - -%% Nodes - S1(Seeds) - - S2(Leaves) - - M1(RNA) - M2(protein) - M3(cDNA) - M4(RNASeq Libraries) - M5(SDS-gel) - M6(western blot) - - P1>plant growth] - P2>RNA extraction] - P3>protein extraction] - P4>cDNA synthesis] - P5>qRT-PCR] - P6>Library preparation] - P7>Next Generation Sequencing] - P8>SDS Page] - P9>taking a photo] - P10>Immunoblotting] - P11>mapping] - - D1("qRT results") - D2(fastq files) - D3(Image of \n SDS gel) - D4(reference \n genome) - D5(count table) - - -%% Links - -subgraph Studies - subgraph study:drought - S1 ---P1--drought\nstress--> S2 - end - - subgraph study:genome-ref - P12>Download] - x(Paper supplement) ---P12--> D4 - end - -end - - -subgraph Assays - - subgraph assay:qRT-PCR - S2 ---P2--> M1 - M1 ---P4--> M3 - M3 ---P5--> D1 - end - - subgraph assay:SDS-gel - S2 ---P3--> M2 - M2 ---P8--> M5 - M5 ---P9--> D3 - end - - subgraph assay:RNA-Seq - M1 ---P6--> M4 - M4 ---P7--> D2 - end - - subgraph assay:western Blot - M5 ---P10--> M6 - end - -end - -subgraph Worklows/Runs - - subgraph workflow:mapping - D2 --- P11 - D4 --- P11 - end - - subgraph run - P11 --> D5 - end - -end - - -%% Add legend -subgraph Legend - Sx(Sample) - Px>Process] - Mx(Material) - Dx(Data) -end - -%% Defining node styles - classDef S fill:#b4ce82, stroke:#333; - classDef M fill:#ffc000; - classDef D fill:#c21f3a,color:white; - classDef P stroke-width:0px; - -%% Assigning styles to nodes - class Sx,S1,S2,S4 S; - class Mx,M1,M2,M3,M4,M5,M6 M; - class Dx,D1,D2,D3,D4,D5 D; - class Px,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13 P; - -%% Box style -style Worklows/Runs fill:#fff, stroke-width:2px, stroke:#333; -style Studies fill:#fff, stroke-width:2px, stroke:#333; -style Assays fill:#fff, stroke-width:2px, stroke:#333; - -``` - - -### Add a new study and sample set - -```mermaid - -%%{ - init: { - 'theme': 'base', - 'themeVariables': { - 'background': '#fff', - 'lineColor': '#2d3e50', - 'primaryTextColor': '#2d3e50' - } - } -}%% - - -graph LR - -%% Nodes - S1(Seeds) - - S2(Leaves) - - M1(RNA) - M2(protein) - M3(cDNA) - M4(RNASeq Libraries) - M5(SDS-gel) - M6(western blot) - - P1>plant growth] - P2>RNA extraction] - P3>protein extraction] - P4>cDNA synthesis] - P5>qRT-PCR] - P6>Library preparation] - P7>Next Generation Sequencing] - P8>SDS Page] - P9>taking a photo] - P10>Immunoblotting] - P11>mapping] - - D1("qRT results") - D2(fastq files) - D3(Image of \n SDS gel) - D4(reference \n genome) - D5(count table) - - -%% Links - -subgraph Studies - subgraph study:drought - S1 ---P1--drought\nstress--> S2 - end - - subgraph study:heat - P13>plant growth] - - S1 ---P13--heat\nstress--> S4(Leaves) - end - - subgraph study:genome-ref - P12>Download] - x(Paper supplement) ---P12--> D4 - end - -end - - -subgraph Assays - - subgraph assay:Another Assay - P14>Process XY] - D6(Output XY) - - S4 ---P14--> D6 - end - - subgraph assay:qRT-PCR - S2 ---P2--> M1 - M1 ---P4--> M3 - M3 ---P5--> D1 - end - - subgraph assay:SDS-gel - S2 ---P3--> M2 - M2 ---P8--> M5 - M5 ---P9--> D3 - end - - subgraph assay:RNA-Seq - M1 ---P6--> M4 - M4 ---P7--> D2 - end - - subgraph assay:western Blot - M5 ---P10--> M6 - end - -end - -subgraph Worklows/Runs - - subgraph workflow:mapping - D2 --- P11 - D4 --- P11 - end - - subgraph run - P11 --> D5 - end - -end - - - - -%% Add legend -subgraph Legend - Sx(Sample) - Px>Process] - Mx(Material) - Dx(Data) -end - -%% Defining node styles - classDef S fill:#b4ce82, stroke:#333; - classDef M fill:#ffc000; - classDef D fill:#c21f3a,color:white; - classDef P stroke-width:0px; - -%% Assigning styles to nodes - class Sx,S1,S2,S4 S; - class Mx,M1,M2,M3,M4,M5,M6 M; - class Dx,D1,D2,D3,D4,D5,D6 D; - class Px,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13,P14 P; - -%% Box style -style Worklows/Runs fill:#fff, stroke-width:2px, stroke:#333; -style Studies fill:#fff, stroke-width:2px, stroke:#333; -style Assays fill:#fff, stroke-width:2px, stroke:#333; - -``` - -:bulb: On a side note, the above is a very wet-lab heavy example. However, conceptually the same applies to computational workflows. Coders oftentimes design their scripts, workflows and pipelines in successive modules with defined inputs and outputs. - - -## Now action - -Once - - - - -## Work with identifiers - -The ARC and the ISA metadata model offer determined places to - -- `Input` and `Output` fields such as Source Name, Sample Name, Data File Names -- `Protocol REF` - - -### Every file (name) is an identifier diff --git a/src/content/docs/vault/arc-user-journey.mdx b/src/content/docs/vault/arc-user-journey.mdx deleted file mode 100644 index 67c48df56..000000000 --- a/src/content/docs/vault/arc-user-journey.mdx +++ /dev/null @@ -1,71 +0,0 @@ ---- -title: ARC User Journey -lastUpdated: 2022-08-05 -authors: - - martin-kuhl -status: published -pagefind: false ---- - -## About this guide - -In this guide we focus on explaining the ARC structure and its different components. - -## Viola's ARC - -Let's imagine a scenario where your project partner suggests at a conference to use this cool new Annotated Research Context (ARC) for your collaboration. Convinced by the versioning system and the single point of entry logic, you are motivated to set up your first own ARC after returning to the lab and fill it with your latest project results. Back home, however, you only remember the basic ARC structure and something about some isa.xlsx files. So how do you transfer your project into the empty ARC your project partner shared with you? - -import BaseArc from '@components/mdx/BaseARC.mdx' - - - -To answer this question, we will first take a look back at Viola's [metadata](/nfdi4plants.knowledgebase/fundamentals/metadata)example: - -> Viola investigates the effect of the plant circadian clock on sugar metabolism in *W. mirabilis*. For her PhD project, which is part of an EU-funded consortium in Prof. Beetroot's lab, she acquires seeds from a South-African Botanical Society. Viola grows the plants under different light regimes, harvests leaves from a two-day time series experiment, extracts polar metabolites as well as RNA and submits the samples to nearby core facilities for metabolomics and transcriptomics measurements, respectively. After a few weeks of iterative consultation with the facilities' heads as well as technicians and computational biologists involved, Viola receives back a wealth of raw and processed data. From the data she produces figures and wraps everything up to publish the results in the *Journal of Wonderful Plant Sciences*. - -The entire information given in this example can be stored within an ARC. To illustrate the [ARC specifications](/nfdi4plants.knowledgebase/core-concepts/arc/#arc-specification), we will highlight and explain every (sub)directory and ISA-file of the ARC with references to Viola's example. - -## isa.investigation.xlsx - -The ISA investigation workbook allows you to record administrative metadata of your project. In Viola's example, the title of the project, the contact persons, and related publications correspond to such metadata. Besides that, the workbook can also contain a short description of your project, but also lists included studies with respective design types, assays, protocols, etc.. Although we recommend to use the [ARC Commander](/nfdi4plants.knowledgebase/arc-commander) for adding these metadata, you can of course fill the workbook (and also the [isa.study.xlsx](#isastudyxlsx) and [isa.assay.xlsx](#isaassayxlsx)) manually. - -## Studies - -In the `studies` (sub)folders you can collect material and resources used within your studies. Corresponding information in Viola's project include the source of her seeds (South-African Botanical Society), how she grew the plants, and the design of the experiment (two-day time series, etc.). - -In case your investigation contains more than one study, each of these studies is placed in an individual subdirectory. The "resources" directory allows you to store material samples or external data as virtual sample files. You can use the protocol subdirectory to store free-text protocols that describe how the samples or materials were created. - -### isa.study.xlsx - -Every study contains one `isa.study.xlsx` file to specify the characteristics of all material and resources. Resources described in a study file can be the input for one or multiple assays or workflows. The workbook contains (at least) two worksheets: - -- "CircadianClock_Light regimes": One or more worksheets, depending on the number of used protocols, to annotate the properties of your source material following the ISA model. The sheet name is not obligatory to be the exact same as the "Study Identifier". While this can be done manually, we recommend using our ontology supported annotation tool [Swate](/nfdi4plants.knowledgebase/swate). -- "Study": Viola collected the administrative metadata of her study in this worksheet. This information can later be transferred into the `isa.investigation.xlsx` using the [ARC Commander](/nfdi4plants.knowledgebase/arc-commander). - -## Assays - -The `assays` folder allows you to store data and metadata from experimental processes or analytical measurements. Each assay is a collection of files stored in a single directory, including corresponding metadata files in form of an `isa.assay.xlsx`. Viola needs two subdirectories, one for her metabolomics and one for her transcriptomics dataset, respectively. Assay data files and free-text protocols are placed in individual subdirectories. Data files produced by an assay can be the input for one or multiple [workflows](#workflows). - -### isa.assay.xlsx - -Viola can annotate her experimental workflows of the metabolomics and transcriptomics assays with process parameters in the `isa.assay.xlsx` file, which needs to be present for every assay. The workbook contains two or more worksheets, depending on the number of used protocols: - - -- "MetaboliteExtraction": A worksheet to annotate the experimental workflow, in this case for extraction of metabolites. While this can be done manually, we recommend using our ontology supported annotation tool [Swate](/nfdi4plants.knowledgebase/swate). -> Note: Using the name of the protocol for the name of the worksheet can provide clarity. -- "MetaboliteMeasurement": A worksheet that describes the quantification of polar metabolites using gas-chromatography mass-spectrometry. -- "Assay": Viola collected the administrative metadata of her assay in this worksheet. This information can later be transferred into the `isa.investigation.xlsx` using the ARC Commander. - -## Workflows - -In an ARC `workflows` represent the processing steps used in computational analyses and other transformations of data originating from studies and assays. Typical examples include data cleaning and preprocessing, computational analysis, or visualization. The outcomes of these workflows ("run results") are stored in [runs](#runs). - -Viola received for her transcriptome and metabolome assays various processed data files, which she now can use to generate some nice plots. Additionally, the computational biologists sent her the code used for data processing, including an executable Common Workflow Language (CWL) file, which contains a standardized tool or workflow description. She stores these files in individual subdirectories for each workflow. - -## Runs - -After Viola generated her plots, she placed them in individual subdirectories, specific to the run they were generated with. In general, you can use the runs folder to store plots, tables, or similar result files that derive from computations on study, assay, external data or other runs. - -## Cheat sheet - -We hope that these examples nicely illustrated the ARC structure and that you are now ready to produce your own ARCs. Use the figure below as a cheat sheet to remember where to store which files. Or follow the [ARC Commander QuickStart](/nfdi4plants.knowledgebase/arc-commander/arc-commander-quick-start) to try it out yourself. diff --git a/src/content/docs/vault/cwl.md b/src/content/docs/vault/cwl.md deleted file mode 100644 index a439387c9..000000000 --- a/src/content/docs/vault/cwl.md +++ /dev/null @@ -1,114 +0,0 @@ ---- -title: Reproduce and reuse -lastUpdated: 2022-08-08 -authors: - - dominik-brilhaus -draft: true -hidden: true -pagefind: false ---- - -> Note: This is just a first collection of thoughts. -> Could be partitioned into fundamentals/implementation/tutorial - -## Fundamentals: (code / software) reproducibility - - -Reproducibility in science overall - -wet lab | dry lab ---- | --- -company RNA extraction kit with all buffers and most of materials and tools| established / (commercial) software; somewhat contained, isolated, self-sustained -"manual" protocol where you buy and mix buffers together yourself | script or combinations of scripts (pipeline) with varying inputs (reference data sets) and tool dependencies (code interpreters, packages, functions) -version, batch or LOT number | software / package version -laboratory environment | operating system - - -In the wet-lab many more factors affect reproducibility, making it close to impossible to reproduce the exact same outcomes (results, datasets) -- biological variance -- hands-on factor (more hands, bigger variance) -- environment (humidity, temperature), but also standard devices (growth chamber, centrifuge) - - - -- Reproducibility of computational analyses - - a) you can "reproduce" that exact same output (run result) using the exact same inputs - - b) you can apply the analysis onto other data to produce analogous outputs, that can be fed into other workflows (e.g. generate similar figures) - -- How we usually (learn to) work with scripts - - interactive, iterative - - adapt script to specific needs - - write (hard-code) inputs, outputs into script - -- Problem - - hand script to colleague - - script not working due to missing (software) dependencies, changed (absolute) paths to environments / inputs / other dependencies (e.g. database resources) - -- Example sources for scripts - - workshop / summer school - - colleagues - - manual / tutorial to a tool (downloaded and adapted from GitHub) - - copy/pasted from stack overflow - -- Software dependencies - - on multiple levels / in different shapes - - operating system (Linux, Windows, Mac) - - programming environment / interpreter (shell, python, r, julia, f#) - - packages / libraries within the programming environment - - version of one of above - - (use of) virtual environments - -- Towards solutions - - containers - - docker, singularity - - workflow languages - - CWL, snakemake, neftflow - - environment-agnostic - - formulate ins, outs, parameters - - - - -## Implementation: Make your ARC reproducible / executable with CWL - -1. add workflows / scripts to `workflows` -2. Make workflows CWL-executable, by adding (parallel to the workflow / in the same workflows subdir) a .cwl file that - - describes the expected inputs, outputs, and parameters -3. Execute the workflow - 1. "directly", calling the parameters via CLI - - ```bash - cwltool my_workflow.cwl -p1 parameter1 -p2 parameter2 - ``` - - 2. referencing to a YAML file, that collects the required parameters - ```bash - cwltool my_workflow.cwl my_workflow_parameters.yml - ``` - -- use of paths / working directories -- runs folder -- Workflow metadata: my_workflow_parameters.yml - -## Tutorial: CWL Generator quickstart - -### Install - -[gh-CWLgenerator][https://github.com/nfdi4plants/CWLGenerator] - -### Dependencies - -- Node.js (required for CWL Generator) -- cwltool / cwl-runner -- Docker (?) - -### Recommendations - -- VS code extension [CWL (Rabix/Benten)](https://marketplace.visualstudio.com/items?itemName=sbg-rabix.benten-cwl) - - -### Note / Typical errors - -- (re)moved a required input or output -- cwltool can neither resolve "~" nor $HOME ?! -- let recurrent variables (script name, outfolder, etc.) come first diff --git a/src/content/docs/vault/reproduce-reuse.md b/src/content/docs/vault/reproduce-reuse.md deleted file mode 100644 index be95aa100..000000000 --- a/src/content/docs/vault/reproduce-reuse.md +++ /dev/null @@ -1,69 +0,0 @@ ---- -title: Reproduce and reuse -lastUpdated: 2022-09-23 -authors: - - dominik-brilhaus -draft: true -hidden: true -pagefind: false ---- - -> This article is work-in-progress. - -Key aspects of the [FAIR principles](/nfdi4plants.knowledgebase/fundamentals/fair-data-principles) and driver for the development of good [RDM](/nfdi4plants.knowledgebase/fundamentals/research-data-management) are *reproducibility* and *re-usability* (FAI**R**) of scientific outputs as well as workflows leading to these outputs. Although here we focus more on data and the "computational side", we would like to emphasize some analogies between **Data** science and **PLANT** science. Especially as some requirements in both environments can at least in part be met with similar approaches. - -Consider our PhD Viola (see [metadata](/nfdi4plants.knowledgebase/fundamentals/metadata)). In the wet lab, she extracts RNA from her plant samples using a ready-to-use commercial extraction kit with all buffers and some required materials and tools included. Similarly in the dry lab she would use an established, commercial office software that is mostly contained/isolated, for small spread-sheet calculations. There is no commercial kit available to extract metabolites suitable with the special plant species Viola is interested in. So she uses a "manual" protocol established in her lab, for which she orders and prepares buffers and solutions herself and gathers the required devices, tubes and materials. Once she receives her RNA-Seq data, she sets up her own combinations of scripts (pipeline) with varying inputs (reference data sets) and tool dependencies (code interpreters, packages, functions). In the end, Viola's complete workspace, be it the laboratory environment or her computer's operating system, comes with its specific setup, tools, resources and limitations. And her research routine would likely differ if she were to pursue it in a different lab or using another computer. - -For both types of workflows, there are (clearly) defined inputs and outputs, e.g. the state of the or the data format. And Viola makes sure to document as much metadata as possible to make her workflows reproducible, including e.g. version, batch or LOT numbers of a kit or chemical and the versions of software and packages. Also trouble-shooting with a colleague, company, data steward or seeking help in online forums, is always easier if you share information about your setting. - - - -## On the shoulders of giants - - -"In real life" you can take a sample once and only once. You can take replicate samples – technical (same plant different leaf) or biological (different plant) –, but in the end this is a new and different sample. In the wet-lab many more factors affect reproducibility, making it close to impossible to reproduce the exact same outcome (results, datasets). These include biological variance, hands-on factors (more hands, bigger variance), the environment (humidity, temperature), but also deviations in standard devices (growth chamber, centrifuge). - - Still for other researchers to be able to re-use (i.e. build on) your findings, it will be helpful to document, metadata... - - -1. re-use an outcome (data or sample) -2. reproduce an outcome (peer-review) -3. re-use a workflow (lab protocol or analysis) - - - -- Reproducibility of computational analyses - - a) you can "reproduce" that exact same output (run result) using the exact same inputs - - b) you can apply the analysis onto other data to produce analogous outputs, that can be fed into other workflows (e.g. generate similar figures) - -- How we usually (learn to) work with scripts - - interactive, iterative - - adapt script to specific needs - - write (hard-code) inputs, outputs into script - -- Problem - - hand script to colleague - - script not working due to missing (software) dependencies, changed (absolute) paths to environments / inputs / other dependencies (e.g. database resources) - -- Example sources for scripts - - workshop / summer school - - colleagues - - manual / tutorial to a tool (downloaded and adapted from GitHub) - - copy/pasted from stack overflow - -- Software dependencies - - on multiple levels / in different shapes - - operating system (Linux, Windows, Mac) - - programming environment / interpreter (shell, python, r, julia, f#) - - packages / libraries within the programming environment - - version of one of above - - (use of) virtual environments - -- Towards solutions - - containers - - docker, singularity - - workflow languages - - CWL, snakemake, nextflow - - environment-agnostic - - formulate ins, outs, parameters - - workflow management systems - - galaxy