Skip to content

Commit

Permalink
Merge pull request #33 from rki-mf1/dev
Browse files Browse the repository at this point in the history
Merge dev
  • Loading branch information
Krannich479 authored Apr 10, 2024
2 parents 322aa28 + 89aced6 commit bb65382
Show file tree
Hide file tree
Showing 44 changed files with 584 additions and 1,671 deletions.
34 changes: 13 additions & 21 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,20 @@ on:
# designed as in: https://github.com/marketplace/actions/setup-miniconda
jobs:
CI:
name: CI tests using linux
name: CI (Linux)
runs-on: "ubuntu-latest"
defaults:
run:
shell: bash -el {0}
steps:
- uses: actions/checkout@v2
- uses: conda-incubator/setup-miniconda@v2
- uses: actions/checkout@v4
- uses: conda-incubator/setup-miniconda@v3
with:
miniconda-version: "latest"
python-version: "3.11.3"
activate-environment: snakemake7
environment-file: env/conda_snakemake7.yaml
activate-environment: nextflow
environment-file: env/conda_nxf.yml
channels: conda-forge,bioconda,defaults
channel-priority: strict
channel-priority: true
auto-activate-base: false

- name: Test conda installation
Expand All @@ -33,29 +32,22 @@ jobs:
conda config --show-sources
conda config --show
- name: Test snakemake installation
- name: Test nextflow installation
run: |
snakemake --version
nextflow -version
- name : Download reference
run: |
wget https://www.ebi.ac.uk/ena/browser/api/fasta/MN908947.3
sed 's/>ENA|MN908947|MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome./>MN908947.3/g' MN908947.3 > MN908947.3.fasta
- name: Test CIEVaD principal functionality
run: |
python cievad.py --help
mkdir -p reference/Sars-Cov-2/Wuhan-Hu-1/
mv MN908947.3.fasta reference/Sars-Cov-2/Wuhan-Hu-1/
- name: Test haplotype simulation
run: |
python cievad.py hap -n 3 -r MN908947.3.fasta
nextflow run hap.nf -profile local,conda
- name: Test NGS simulation
- name: Test callset evaluation
run: |
python cievad.py ngs -n 3 -f 1000
nextflow run eval.nf -profile local,conda --callsets_dir aux/ci_data/
- name: Test Nanopore simulation
run: |
python cievad.py nano -n 3 -r 100
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,8 @@ results/

aux/nanosim_model/human_NA12878_DNA_FAB49712_guppy.tar.gz

*.pyc
*.pyc

.nextflow.log*
.nextflow/
work/
74 changes: 46 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,31 @@
![Static Badge](https://img.shields.io/badge/requires-conda-blue)
![Static Badge](https://img.shields.io/badge/requires-snakemake-blue)
![Static Badge](https://img.shields.io/badge/requires-nextflow-blue)

# CIEVaD
<ins>C</ins>ontinuous <ins>I</ins>ntegration and <ins>E</ins>valuation for <ins>Va</ins>riant <ins>D</ins>etection. This repository provides a tool suite for simple, streamlined and rapid creation and evaluation of genomic variant callsets. It is primarily designed for continuous integration of variant detection software and a plain containment check between sets of variants. The tools suite utilizes the _conda_ package management system and _Snakemake_ workflow language.
<ins>C</ins>ontinuous <ins>I</ins>ntegration and <ins>E</ins>valuation for <ins>Va</ins>riant <ins>D</ins>etection. This repository provides a tool suite for simple, streamlined and rapid creation and evaluation of genomic variant callsets. It is primarily designed for continuous integration of variant detection software and a plain containment check between sets of variants. The tools suite utilizes the _conda_ package management system and _nextflow_ workflow language.

## Contents:
1. [System requirements](#system-requirements)
2. [Installation](#installation)
3. [Usage](#usage)
4. [Help](#help)
4. [Output](#output)
5. [Help](#help)


## System requirements:

This tool suite was developed under Linux/UNIX and is the only officially supported operating system here.
Having any derivative of the `conda` package management system installed is the only strict system requirement.
Having a recent `snakemake` (≥6.0.0) and `python` (≥3.2) version installed is required too but both can be installed via conda (see [Installation](#installation)).
This tool suite was developed for Linux and is the only officially supported operating system here.
Having any derivative of the conda package management system installed is the only strict system requirement.
A recent version (≥20.04.0) of nextflow is required to execute the workflows, but can easily be installed via conda.
For an installation instruction of nextflow via conda see [Installation](#installation).

<details><summary>🛠️ See tested setups: </summary>
<details><summary>🛠️ See list of tested setups: </summary>

| Requirement | Tested with |
| --- | --- |
| 64 bits operating system | Ubuntu 20.04.5 LTS |
| [Conda](https://docs.conda.io/en/latest/) | vers. 23.5.0 |
| [Snakemake](https://snakemake.readthedocs.io/en/stable/) | vers. 7.25.3 |
| 64 bits Linux operating system | Ubuntu 20.04.5 LTS |
| [Conda](https://docs.conda.io/en/latest/) | vers. 23.5.0, 24.1.2|
| [Nextflow](https://nextflow.io/) | vers. 20.04.0, 23.10.1 |

</details>

Expand All @@ -32,43 +34,59 @@ Having a recent `snakemake` (≥6.0.0) and `python` (≥3.2) version installed i

1. Download the repository:
```
git clone https://github.com/rki-mf1/imsmp-variant-calling-benchmark.git
git clone https://github.com/rki-mf1/cievad.git
```

2. [Optional] Install Snakemake if not yet on your system. You can use the conda environment description file provided in this repository:
2. [Optional] Install nextflow if not yet on your system. For good practise you should use a new conda environment:
```
conda deactivate
conda env create -f env/conda_snakemake7.yaml
conda activate snakemake7
conda create -n cievad -c bioconda nextflow
conda activate cievad
```


## Usage:

This tool suite provides multiple workflows to generate synthetic sequencing data and evaluate sets of predicted variants (callsets).
A full list of workflows, their respective modules in the python command line interface (CLI) and a detailed description of input and output files can be found in this [wiki](https://github.com/rki-mf1/imsmp-variant-calling-benchmark/wiki) page of the repository.
The current list of principal functionality is:
* Generating synthetic haplotypes from a given reference genome
* Generating synthetic NGS reads from a given haplotype
* Generating synthetic amplicon sequences from a given reference genome and generating synthetic NGS reads from the amplicons
* Generating synthetic long-reads from a given haplotype
* Evaluate compliance between sets of variants

The repository provides a simple CLI for a convenient application-like user experience with the underlying Snakemake workflows.
The CLI is started from the root directory via
This tool suite provides multiple functional features to generate synthetic sequencing data, generate sets of ground truth variants (truthsets) and evaluate sets of predicted variants (callsets).
There are two main workflows, `hap.nf` and `eval.nf`.
Both workflows are executed via the nextflow command line interface (CLI).
The current list and roadmap of principal functionality is:
* [x] Generating synthetic haplotypes from a given reference genome. This returns a haplotype sequence (FASTA) and its set of variants (VCF) with respect to the reference.
* [x] Generating synthetic NGS reads from a given haplotype
* [ ] Generating synthetic amplicon sequences from a given reference genome and generating synthetic reads from those amplicons
* [ ] Generating synthetic long-reads from a given haplotype
* [x] Evaluate compliance between sets of variants

### Generating haplotype data
The minimal command to generate haplotype data is
```
python cievad.py --help
nextflow run hap.py -profile local,conda
```
and each individual module provides another help page via its sub-command

### Evaluating variant calls
The minimal command to evaluate the accordance between a truthset (generated data) and a callset is
```
python cievad.py <module> --help
nextflow run eval.nf -profile local,conda --callsets_dir <path/to/callsets>
```
where `--callsets_dir` is the parameter to specify a folder containing the callset VCF files.
Currently, a callset within this folder has to follow the naming convention `callset_<X>.vcf[.gz]` where _\<X\>_ is the integer of the corresponding truthset.
Callsets can optionally be _gzip_ compressed.

🚧 For convenience, the `eval.nf` will get an option to provide a sample sheet as an alternative input format in the future.

<details><summary>⚠️ Run commands from the root directory </summary>
Without further ado, please run the commands from a terminal at the top folder (root directory) of this repository.
Otherwise relative paths within the workflows might be invalid.
</details>

### Tuning the workflows via CLI parameters
\<TODO\>

### Tuning the workflows via the config file
\<TODO\>

## Output
\<TODO\>

## Help:

Expand Down
3 changes: 3 additions & 0 deletions aux/ci_data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# CI Data

(10.04.2024) The `callset_{1,2,3}.vcf.gz` are renamed but original `hap{1,2,3}.filtered.gt_adjust.filtered_indels.vcf.gz` VCF files containing variants from the CovPipe2 workflow using default parameters.
Binary file added aux/ci_data/callset_1.vcf.gz
Binary file not shown.
Binary file added aux/ci_data/callset_2.vcf.gz
Binary file not shown.
Binary file added aux/ci_data/callset_3.vcf.gz
Binary file not shown.
Binary file removed bin/SURVIVOR
Binary file not shown.
Binary file removed bin/amplisim-v0_1_0-ubuntu_20_04
Binary file not shown.
Binary file removed bin/mason_simulator
Binary file not shown.
Loading

0 comments on commit bb65382

Please sign in to comment.