A pipeline to process Nanopore reads and transfer the results to the end users.
git clone [email protected]:maxplanck-ie/nanoporeReads_dataTransfer.git
cd nanoporeReads_dataTransfer
mamba env create -n ont -f env.yaml
mamba activate ont
pip install .
For Apple M1/M2 (arm64) many conda packages are not yet available. Use instead:
CONDA_SUBDIR=osx-64 mamba create -n ont -f env.yaml
The key functionality is achieved using snakemake workflows. From version 2.0.0 two different snakemake rule sets are supported which are centered around two different basecallers:
rules_dorado
: a dorado-based workflow.
A wrapper python script (ont.py
) implements
- the continuous screening of the source directory,
- the generation of a flowcell-specific configuration file, and
- the communication with enduser (emails etc.)
The main configuration file (config.yaml
) specifies:
- the paths for the rule set be used (
rulesPath: rules
orrules_dorado
), - the overall directory structure (see below)
- organism-specific paths (e.g. genome and transcriptome locations)
- communication settings (email, Parkour LIMS, sambahost)
- generic parameters (basecalling, mapping)
Notice that the generic configuration defined by this file is expanded by project-specific entries for each incoming flowcell
Additional configuration files are:
env.yaml
(for conda installation of all dependencies)multiqc_config.yaml
(to customize multiqc output)
ont -c config.yaml
The workflow connects and relies on three main data locations:
- A source directory (
offloadDir
) is screened for the arrival of new and unprocessed flowcells - A work directory (
outputDir
) is used for various processing steps (merging, basecalling, demultiplexing, alignment, quality controls) - The target directory (
groupDir
) receives the analysis results in a project-wise manner.
The details are rule-set dependent. Annotated examples for rules_dorado
is given below
This directory is generated by the sequencing machine and may change in response to technological developments.
../path/to/flowcell/
.
├── bam_pass # from fast basecalling
├── barcode_alignment_PAS33554_6b0029ab_a0fbcf5b.tsv
├── fastq_pass # from fast basecalling
├── final_summary_PAS33554_6b0029ab_a0fbcf5b.txt
├── other_reports
├── pod5_pass # pod5 format
├── pore_activity_PAS33554_6b0029ab_a0fbcf5b.csv
├── report_PAS33554_20230928_1016_6b0029ab.html
├── report_PAS33554_20230928_1016_6b0029ab.json
├── report_PAS33554_20230928_1016_6b0029ab.md
├── SampleSheet.csv # sample sheet information
├── sample_sheet_PAS33554_20230928_1016_6b0029ab.csv
├── sequencing_summary_PAS33554_6b0029ab_a0fbcf5b.txt
└── throughput_PAS33554_6b0029ab_a0fbcf5b.csv
../path/to/flowcell
.
├── analysis.done # flag to signal that this folowcell has been fully processed
├── bam # output from basecalling in bam format (including modificaytion calls)
├── bam_demux # demulitplex samples (empty if no barcoding)
├── benchmarks # benchmarks for each rule
├── benchmarks_combined.tsv # combined benchmark file
├── flags # directory with flags from snakemake rules
├── log # log files (rule-specific)
├── pipeline_config.yaml # configfile (snakemake & more)
├── pod5 # directory with merged pod5 file (from offloadDir)
├── reports # directory with reports and SampleSheet.csv (from offloadDir)
├── summary # summary files (DAG, disk status)
└── transfer # analysis output that will be transferred)
transfer/
└── Project_projectID_User_Group
├── Analysis_mouse_dna # analysis directory (exists only if genome is known)
│ ├── 23L000329_WT_rep1.align.bam # alignment
│ ├── 23L000329_WT_rep1.align.bam.bai # index
│ └── 23L000329_WT_rep1.align.bed.gz # modification calls
├── Data
│ ├── 23L000329_WT_rep1.bam # basecalled sequences
│ ├── 23L000329_WT_rep1.fastq.gz # basecalled sequences (fastq - deprecated)
│ ├── 23L000329_WT_rep1_porechop.fastq.gz # adaptors, barcodes trimmed
│ └── 23L000329_WT_rep1.seqsum # sequencing summaries (for pycoQC etc )
└── QC
├── multiqc
│ ├── multiqc_data
│ └── multiqc_report.html # multiqc report
├── sample_names.tsv # dictionary sampleID-sampleName
└── Samples # samples-wise quality controls
├── 23L000329_WT_rep1.align.flagstat
├── 23L000329_WT_rep1.align_pycoqc.html
├── 23L000329_WT_rep1.align_pycoqc.json
├── 23L000329_WT_rep1_fastqc.html
├── 23L000329_WT_rep1_fastqc.zip
├── 23L000329_WT_rep1_kraken.report
├── 23L000329_WT_rep1_porechop.info
├── 23L000329_WT_rep1_pycoqc.html
├── 23L000329_WT_rep1_pycoqc.json
├── all_porechop.best_end
├── all_porechop.best_start
└── all_porechop.trimmed
../user_path/to/flowcell/ (identical to outputDir/transfer)
.
├── metadata.yaml
└── Project_projectID_User_Group
├── Analysis_mouse_dna
├── Data
└── QC