Skip to content

Commit

Permalink
Merge pull request #521 from nfdi4plants/clean-vault
Browse files Browse the repository at this point in the history
Clean-vault
  • Loading branch information
Brilator authored Nov 13, 2024
2 parents ba67651 + 5db7182 commit 2fb4e8a
Show file tree
Hide file tree
Showing 8 changed files with 171 additions and 1,086 deletions.
80 changes: 80 additions & 0 deletions src/content/docs/fundamentals/reproduce-reuse.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
title: Reproduce and reuse
lastUpdated: 2022-09-23
authors:
- dominik-brilhaus
draft: true
pagefind: false
---

This guide outlines key principles and practical steps for achieving reproducibility in both wet-lab and computational (dry-lab) environments. It aims to help researchers in biological sciences understand and implement reproducibility in their work, ensuring that experimental outcomes, data, and analyses can be reliably repeated by others.

## Reproducibility in Science: Wet-Lab vs Dry-Lab

| Wet Lab | Dry Lab |
| ------- | ------- |
| Company RNA extraction kit with all buffers and most materials and tools | Established/commercial software; somewhat contained, isolated, self-sustained |
| "Manual" protocol where you mix buffers together yourself | Scripts or combinations of scripts (pipelines) with varying inputs (reference datasets) and tool dependencies (code interpreters, packages, functions) |
| Version, batch, or lot number of materials | Software/package version |
| Laboratory environment (humidity, temperature, equipment) | Operating system (Linux, Windows, Mac) |

## Challenges in Wet-Lab Reproducibility

In the wet-lab, many factors influence reproducibility, making it difficult to recreate the exact same results. These factors include:

- **Biological variance**: Even with the same protocols and conditions, biological systems often exhibit inherent variability.
- **Hands-on factors**: More individuals handling the experiment can introduce variability.
- **Environmental factors**: Humidity, temperature, and even the specific equipment used (e.g., centrifuges, growth chambers) can affect results.

## Reproducibility in Computational Analyses

Reproducibility in computational analyses generally focuses on two key aspects:

1. **Exact output reproduction**: Ensuring that the same input data will consistently yield the same result when the analysis is rerun.
2. **Flexible workflow application**: Ensuring that workflows and analysis pipelines can be applied to different datasets, producing analogous results that can be fed into other analyses or workflows (e.g., generating similar figures).

## How We Typically Work with Scripts in Computational Workflows

In computational biology, scripts are often:

- **Interactive and iterative**: Researchers frequently modify and rerun scripts in response to their data or research questions.
- **Adapted for specific needs**: Researchers often adapt generic scripts to their specific datasets, tweaking them as they go.
- **Hard-coded**: Inputs, outputs, and parameters are sometimes hard-coded directly into the script, which can lead to issues when sharing or transferring the script to others.

## Common Problems with Reproducibility in Computational Workflows

One of the main challenges in reproducibility is sharing scripts with others:

- **Missing dependencies**: When passing a script to a colleague, it might not work because of missing software dependencies, different versions of libraries, or changed file paths.
- **Environmental differences**: Different operating systems, system configurations, or setups may lead to issues in running the script as intended.

## Common Sources for Scripts

Researchers typically source scripts from:

- **Workshops or summer schools**: Scripts often come from educational events and are adapted for specific use.
- **Colleagues**: Researchers share their scripts with peers, who then modify them for their own needs.
- **Manuals or tutorials**: Many scripts are adapted from tutorials available online (e.g., from GitHub repositories).
- **Community forums**: Script snippets often come from community-driven sites like Stack Overflow.

## Software Dependencies and Environment Management

Reproducibility can break down due to the numerous dependencies and system requirements involved in computational workflows. These include:

- **Operating systems**: Different platforms (Linux, Windows, Mac) can affect how software runs.
- **Programming environments**: Variations in the programming language (e.g., Python, R, Julia) or the environment (e.g., Shell, Jupyter notebooks) can cause inconsistencies.
- **Package versions**: Even the same software package can behave differently between versions, leading to unexpected results.
- **Virtual environments**: Without using tools like virtual environments or containers, different users might have conflicting software setups.

## Solutions for Reproducibility

Several tools and approaches can help address these issues and improve reproducibility:

- **Containers**: Using Docker or Singularity allows you to package software, dependencies, and environments into a portable container that can be executed consistently across different systems.
- **Workflow languages**: Tools like **CWL** (Common Workflow Language), **Snakemake**, and **Nextflow** help create standardized workflows that are environment-agnostic, specifying input/output parameters and dependencies in a way that’s easy to share and reproduce.

## Towards a Reproducible Research Environment

Reproducibility is a critical principle in both biological and computational research. By carefully structuring your workflows, using version control, managing dependencies with tools like containers and CWL, and applying FAIR principles to your data, you can ensure that your research can be reliably reproduced and shared.

By adopting these practices, you’ll not only improve the robustness and transparency of your own work, but also make it easier for others to build upon your research in the future.
91 changes: 91 additions & 0 deletions src/content/docs/guides/arc-practical-entry.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
title: Creating an ARC for Your Project
lastUpdated: 2024-11-13
authors:
- dominik-brilhaus
sidebar:
order: 0
badge:
text: new
variant: tip
---

You followed Viola's steps during the [start here](/nfdi4plants.knowledgebase/start-here/) guide and are now overwhelmed? Sure, a guide streamlined onto a demo dataset is a whole different story than achieving this with your own complex data.
Here we provide recommendations and considerations for structuring an ARC based on **your current project and datasets**. Remember: creating an ARC is an ongoing process, and it’s meant to evolve over time.


## The "final" ARC does not exist – Immutable, Yet Evolving!

Think of your ARC as an evolving entity that adapts and improves as your project progresses.

- **Don't aim for perfection right away:** At first, your ARC doesn't need to be flawless. You’re not expected to win an award for the best ARC from the outset. The goal is for it to be useful to **you**. As long as your ARC serves its purpose—whether by organizing data, tracking workflows, or aiding in reproducibility—that’s a win.
- **Priorities vary across researchers:** Different people may have different ideas about what should be made FAIR first and what can be polished later. Allow yourself to start with the basics and improve it step by step.

So, **don't stress** about making your ARC perfect from the get-go—focus on making it functional.

## Start Simple: Just Dump the Files Into Your ARC

An ARC’s core principle is that "everything is a file." It’s common to work with a collection of files and folders in your daily research. Why not just start by organizing them into an ARC?

- **Initial File Dump:** At first, don’t worry too much about the precise structure. Simply place your files into an “**additional payload**” folder within the ARC. This will help you get started without overthinking the details.
- **Version Control with Git:** By putting your files in the ARC, you instantly gain the benefit of [version control through Git](/nfdi4plants.knowledgebase/fundamentals/version-control-git). This helps you track changes and maintain a history of your files.
- **Safe Backup via [DataHUB](/nfdi4plants.knowledgebase/datahub):** Once you upload your ARC to the [DataHUB](/nfdi4plants.knowledgebase/datahub), you’ll also have a secure backup of your files.

:::tip
If you’re dealing with large files (e.g., raw sequencing data), you can initially store them anywhere. Just make sure they’re tracked with [Git LFS (Large File Storage)](/nfdi4plants.knowledgebase/git/git-lfs). This way, you can later move the LFS pointers into your ARC without dealing with the actual large files.
:::

## Add Metadata to Make Your ARC More Shareable and Citable

Next, enrich your ARC with some **basic metadata**:

- **Project and Creator Info:** Include metadata about your project and the researchers involved. This step makes your ARC more sharable and **citable** from the start.
- **Link to the Investigation:** Add this metadata to your `investigation` section. This is an easy way to ensure your work is discoverable and properly credited.

## Sketch Your Laboratory Workflows

A key goal of an ARC is to trace each finding or result back to its originating biological experiment. To achieve this, your ARC will need to link dataset files to individual samples through a series of **processes** (laboratory or computational steps) with defined **inputs** and **outputs**.

- **Map Out Your Lab Workflows:** Before diving into the structure of your ARC, take some time to **sketch** what you did in the lab. What experiments did you perform? What samples did you analyze? Which protocols did you follow? This sketch will help you understand how to organize your data and workflows later.

---

## Organize Your Files into `studies` and `assays`

Once you have a better understanding of your lab processes, you can begin organizing your ARC:

- **Define `studies` and `assays`:** Structure your data by moving files into relevant folders, such as `studies` and `assays`. This makes it clear where the raw data (`dataset`) is stored and which protocols were used to generate that data.
- **Reference Protocols:** As you organize, simply reference the **existing protocols** (stored as free-text documents) in your ARC. This ensures consistency without overwhelming you with unnecessary details at this stage.

## Simple First: Link `Input` and `Output` Nodes

Before delving into complex parameterization or detailed annotation tables, start simple:

- **Connect Inputs and Outputs:** Begin by connecting your `studies` and `assays` through **input** and **output** nodes. This allows you to trace the flow of data through your workflows without getting bogged down by excessive detail.
- **Re-draw Lab Workflows:** At this stage, you can essentially redraw your lab workflows as tables, mapping each process step to its inputs and outputs.

## Parameterize Your Protocols for Machine Readability

Once you have the basic structure in place, you can start making your data more **machine-readable** and **searchable**:

- **Parameterize Protocols:** To improve reproducibility, break down your protocols and workflows into structured annotation tables. This will allow you to capture the parameters used at each step of your research.
- **Make It Searchable:** This will make your study more **discoverable** and ensure that your methods are clear and reproducible.

## Keep It Simple for Your Data Analysis Workflows

The same approach applies to your data analysis workflows:

- **Treat Data Analysis as Protocols:** Regardless of whether your data analysis involves clickable software or custom code, treat it like a **protocol**. For now, just store the results in your `dataset` folder.
- **Iterate as You Go:** You don’t need to go into deep detail at first. Just focus on capturing the core analysis steps, and refine them later as your project progresses.

## Making Data Analysis More Reproducible: Use CWL, Containers, and Dependency Management

If you want to make your data analysis more **reproducible** and ensure that your workflows are **easily reusable**, consider wrapping your analysis tools in **CWL** (Common Workflow Language) and using **containers**:

- **CWL for Reproducibility:** Use CWL to describe your computational workflows in a standardized way. This ensures that others can run your analysis with the same inputs and parameters, regardless of their system.
- **Containerization:** Leverage Docker or Singularity containers to encapsulate all software dependencies. This makes it easier to share your workflows and ensures they run consistently across different environments.
- **Manage Dependencies:** Use tools like Conda or Docker to manage your software dependencies, avoiding issues with mismatched versions or missing libraries.

## **Conclusion: The ARC is a Living FAIR Digital Object**

The process of creating an ARC is **gradual** and **evolving**. Start simple, and focus on getting the basics in place. Over time, you can refine and enhance your ARC to improve its usefulness and functionality, making it a valuable tool for organizing, sharing, and reproducing your research.
209 changes: 0 additions & 209 deletions src/content/docs/vault/ARC-practical-entry-stepwise.md

This file was deleted.

Loading

0 comments on commit 2fb4e8a

Please sign in to comment.