Merge pull request #33 from rki-mf1/dev

Merge dev
rki-mf1 · Apr 10, 2024 · bb65382 · bb65382
2 parents 322aa28 + 89aced6
commit bb65382
Show file tree

Hide file tree

Showing 44 changed files with 584 additions and 1,671 deletions.
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -9,21 +9,20 @@ on:
 # designed as in: https://github.com/marketplace/actions/setup-miniconda
 jobs:
   CI:
-    name: CI tests using linux
+    name: CI (Linux)
     runs-on: "ubuntu-latest"
     defaults:
       run:
         shell: bash -el {0}
     steps:
-      - uses: actions/checkout@v2
-      - uses: conda-incubator/setup-miniconda@v2
+      - uses: actions/checkout@v4
+      - uses: conda-incubator/setup-miniconda@v3
         with:
           miniconda-version: "latest"
-          python-version: "3.11.3"
-          activate-environment: snakemake7
-          environment-file: env/conda_snakemake7.yaml
+          activate-environment: nextflow
+          environment-file: env/conda_nxf.yml
           channels: conda-forge,bioconda,defaults
-          channel-priority: strict
+          channel-priority: true
           auto-activate-base: false
 
       - name: Test conda installation
@@ -33,29 +32,22 @@ jobs:
           conda config --show-sources
           conda config --show
 
-      - name: Test snakemake installation
+      - name: Test nextflow installation
         run: |
-          snakemake --version
+          nextflow -version
 
       - name : Download reference
         run: |
           wget https://www.ebi.ac.uk/ena/browser/api/fasta/MN908947.3
           sed 's/>ENA|MN908947|MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome./>MN908947.3/g' MN908947.3 > MN908947.3.fasta
-
-      - name: Test CIEVaD principal functionality
-        run: |
-          python cievad.py --help
+          mkdir -p reference/Sars-Cov-2/Wuhan-Hu-1/
+          mv MN908947.3.fasta reference/Sars-Cov-2/Wuhan-Hu-1/
       
       - name: Test haplotype simulation
         run: |
-          python cievad.py hap -n 3 -r MN908947.3.fasta
+          nextflow run hap.nf -profile local,conda
 
-      - name: Test NGS simulation
+      - name: Test callset evaluation
         run: |
-          python cievad.py ngs -n 3 -f 1000 
+          nextflow run eval.nf -profile local,conda --callsets_dir aux/ci_data/
 
-      - name: Test Nanopore simulation
-        run: |
-          python cievad.py nano -n 3 -r 100
-
-      
diff --git a/.gitignore b/.gitignore
@@ -18,4 +18,8 @@ results/
 
 aux/nanosim_model/human_NA12878_DNA_FAB49712_guppy.tar.gz
 
-*.pyc
+*.pyc
+
+.nextflow.log*
+.nextflow/
+work/
diff --git a/README.md b/README.md
@@ -1,29 +1,31 @@
 ![Static Badge](https://img.shields.io/badge/requires-conda-blue)
-![Static Badge](https://img.shields.io/badge/requires-snakemake-blue)
+![Static Badge](https://img.shields.io/badge/requires-nextflow-blue)
 
 # CIEVaD
-<ins>C</ins>ontinuous <ins>I</ins>ntegration and <ins>E</ins>valuation for <ins>Va</ins>riant <ins>D</ins>etection. This repository provides a tool suite for simple, streamlined and rapid creation and evaluation of genomic variant callsets. It is primarily designed for continuous integration of variant detection software and a plain containment check between sets of variants. The tools suite utilizes the _conda_ package management system and _Snakemake_ workflow language.
+<ins>C</ins>ontinuous <ins>I</ins>ntegration and <ins>E</ins>valuation for <ins>Va</ins>riant <ins>D</ins>etection. This repository provides a tool suite for simple, streamlined and rapid creation and evaluation of genomic variant callsets. It is primarily designed for continuous integration of variant detection software and a plain containment check between sets of variants. The tools suite utilizes the _conda_ package management system and _nextflow_ workflow language.
 
 ## Contents:
 1. [System requirements](#system-requirements)
 2. [Installation](#installation)
 3. [Usage](#usage)
-4. [Help](#help)
+4. [Output](#output)
+5. [Help](#help)
 
 
 ## System requirements:
 
-This tool suite was developed under Linux/UNIX and is the only officially supported operating system here.
-Having any derivative of the `conda` package management system installed is the only strict system requirement.
-Having a recent `snakemake` (≥6.0.0) and `python` (≥3.2) version installed is required too but both can be installed via conda (see [Installation](#installation)).
+This tool suite was developed for Linux and is the only officially supported operating system here.
+Having any derivative of the conda package management system installed is the only strict system requirement.
+A recent version (≥20.04.0) of nextflow is required to execute the workflows, but can easily be installed via conda.
+For an installation instruction of nextflow via conda see [Installation](#installation).
 
-<details><summary>🛠️ See tested setups: </summary>
+<details><summary>🛠️ See list of tested setups: </summary>
 
 | Requirement | Tested with |
 | --- | --- |
-| 64 bits operating system | Ubuntu 20.04.5 LTS |
-| [Conda](https://docs.conda.io/en/latest/) | vers. 23.5.0 |
-| [Snakemake](https://snakemake.readthedocs.io/en/stable/) | vers. 7.25.3 |
+| 64 bits Linux operating system | Ubuntu 20.04.5 LTS |
+| [Conda](https://docs.conda.io/en/latest/) | vers. 23.5.0, 24.1.2|
+| [Nextflow](https://nextflow.io/) | vers. 20.04.0, 23.10.1 |
 
 </details>
 
@@ -32,43 +34,59 @@ Having a recent `snakemake` (≥6.0.0) and `python` (≥3.2) version installed i
 
 1. Download the repository:
 ```
-git clone https://github.com/rki-mf1/imsmp-variant-calling-benchmark.git
+git clone https://github.com/rki-mf1/cievad.git
 ```
 
-2. [Optional] Install Snakemake if not yet on your system. You can use the conda environment description file provided in this repository:
+2. [Optional] Install nextflow if not yet on your system. For good practise you should use a new conda environment:
 ```
 conda deactivate
-conda env create -f env/conda_snakemake7.yaml
-conda activate snakemake7
+conda create -n cievad -c bioconda nextflow
+conda activate cievad
 ```
 
 
 ## Usage:
 
-This tool suite provides multiple workflows to generate synthetic sequencing data and evaluate sets of predicted variants (callsets).
-A full list of workflows, their respective modules in the python command line interface (CLI) and a detailed description of input and output files can be found in this [wiki](https://github.com/rki-mf1/imsmp-variant-calling-benchmark/wiki) page of the repository.
-The current list of principal functionality is:
-* Generating synthetic haplotypes from a given reference genome
-* Generating synthetic NGS reads from a given haplotype
-* Generating synthetic amplicon sequences from a given reference genome and generating synthetic NGS reads from the amplicons
-* Generating synthetic long-reads from a given haplotype
-* Evaluate compliance between sets of variants
-
-The repository provides a simple CLI for a convenient application-like user experience with the underlying Snakemake workflows.
-The CLI is started from the root directory via
+This tool suite provides multiple functional features to generate synthetic sequencing data, generate sets of ground truth variants (truthsets) and evaluate sets of predicted variants (callsets).
+There are two main workflows, `hap.nf` and `eval.nf`. 
+Both workflows are executed via the nextflow command line interface (CLI).
+The current list and roadmap of principal functionality is:
+* [x] Generating synthetic haplotypes from a given reference genome. This returns a haplotype sequence (FASTA) and its set of variants (VCF) with respect to the reference.
+* [x] Generating synthetic NGS reads from a given haplotype
+* [ ] Generating synthetic amplicon sequences from a given reference genome and generating synthetic reads from those amplicons
+* [ ] Generating synthetic long-reads from a given haplotype
+* [x] Evaluate compliance between sets of variants
+
+### Generating haplotype data
+The minimal command to generate haplotype data is
 ```
-python cievad.py --help
+nextflow run hap.py -profile local,conda
 ```
-and each individual module provides another help page via its sub-command
+
+### Evaluating variant calls
+The minimal command to evaluate the accordance between a truthset (generated data) and a callset is
 ```
-python cievad.py <module> --help
+nextflow run eval.nf -profile local,conda --callsets_dir <path/to/callsets>
 ```
+where `--callsets_dir` is the parameter to specify a folder containing the callset VCF files.
+Currently, a callset within this folder has to follow the naming convention `callset_<X>.vcf[.gz]` where _\<X\>_ is the integer of the corresponding truthset.
+Callsets can optionally be _gzip_ compressed.
+
+🚧 For convenience, the `eval.nf` will get an option to provide a sample sheet as an alternative input format in the future.
 
 <details><summary>⚠️ Run commands from the root directory </summary>
 Without further ado, please run the commands from a terminal at the top folder (root directory) of this repository.
 Otherwise relative paths within the workflows might be invalid.
 </details>
 
+### Tuning the workflows via CLI parameters
+\<TODO\>
+
+### Tuning the workflows via the config file
+\<TODO\>
+
+## Output
+\<TODO\>
 
 ## Help:
 

diff --git a/aux/ci_data/README.md b/aux/ci_data/README.md
@@ -0,0 +1,3 @@
+# CI Data
+
+(10.04.2024) The `callset_{1,2,3}.vcf.gz` are renamed but original `hap{1,2,3}.filtered.gt_adjust.filtered_indels.vcf.gz` VCF files containing variants from the CovPipe2 workflow using default parameters.
diff --git a/aux/ci_data/callset_1.vcf.gz b/aux/ci_data/callset_1.vcf.gz
diff --git a/aux/ci_data/callset_2.vcf.gz b/aux/ci_data/callset_2.vcf.gz
diff --git a/aux/ci_data/callset_3.vcf.gz b/aux/ci_data/callset_3.vcf.gz
diff --git a/bin/SURVIVOR b/bin/SURVIVOR
diff --git a/bin/amplisim-v0_1_0-ubuntu_20_04 b/bin/amplisim-v0_1_0-ubuntu_20_04
diff --git a/bin/mason_simulator b/bin/mason_simulator
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# CI Data

		(10.04.2024) The `callset_{1,2,3}.vcf.gz` are renamed but original `hap{1,2,3}.filtered.gt_adjust.filtered_indels.vcf.gz` VCF files containing variants from the CovPipe2 workflow using default parameters.