Skip to content

maxplanck-ie/nanoporeReads_dataTransfer

Repository files navigation

nanoporeReads_dataTransfer

A pipeline to process Nanopore reads and transfer the results to the end users.

Installation

git clone [email protected]:maxplanck-ie/nanoporeReads_dataTransfer.git
cd nanoporeReads_dataTransfer
mamba env create -n ont -f env.yaml 
mamba activate ont
pip install .

For Apple M1/M2 (arm64) many conda packages are not yet available. Use instead:

CONDA_SUBDIR=osx-64 mamba create -n ont -f env.yaml

Implementation

The key functionality is achieved using snakemake workflows. From version 2.0.0 two different snakemake rule sets are supported which are centered around two different basecallers:

  • rules_dorado: a dorado-based workflow.

A wrapper python script (ont.py) implements

  • the continuous screening of the source directory,
  • the generation of a flowcell-specific configuration file, and
  • the communication with enduser (emails etc.)

Configurations

The main configuration file (config.yaml) specifies:

  • the paths for the rule set be used (rulesPath: rules or rules_dorado),
  • the overall directory structure (see below)
  • organism-specific paths (e.g. genome and transcriptome locations)
  • communication settings (email, Parkour LIMS, sambahost)
  • generic parameters (basecalling, mapping)

Notice that the generic configuration defined by this file is expanded by project-specific entries for each incoming flowcell

Additional configuration files are:

  • env.yaml (for conda installation of all dependencies)
  • multiqc_config.yaml (to customize multiqc output)

Usage

ont -c config.yaml

Directory structures

The workflow connects and relies on three main data locations:

  1. A source directory (offloadDir) is screened for the arrival of new and unprocessed flowcells
  2. A work directory (outputDir) is used for various processing steps (merging, basecalling, demultiplexing, alignment, quality controls)
  3. The target directory (groupDir) receives the analysis results in a project-wise manner.

The details are rule-set dependent. Annotated examples for rules_dorado is given below

Example input path (offloadDir)

This directory is generated by the sequencing machine and may change in response to technological developments.

../path/to/flowcell/
.
├── bam_pass            # from fast basecalling
├── barcode_alignment_PAS33554_6b0029ab_a0fbcf5b.tsv
├── fastq_pass          # from fast basecalling
├── final_summary_PAS33554_6b0029ab_a0fbcf5b.txt
├── other_reports
├── pod5_pass           # pod5 format
├── pore_activity_PAS33554_6b0029ab_a0fbcf5b.csv
├── report_PAS33554_20230928_1016_6b0029ab.html
├── report_PAS33554_20230928_1016_6b0029ab.json
├── report_PAS33554_20230928_1016_6b0029ab.md
├── SampleSheet.csv     # sample sheet information
├── sample_sheet_PAS33554_20230928_1016_6b0029ab.csv
├── sequencing_summary_PAS33554_6b0029ab_a0fbcf5b.txt
└── throughput_PAS33554_6b0029ab_a0fbcf5b.csv

Example output path during processing (outputDir)

../path/to/flowcell
.
├── analysis.done            # flag to signal that this folowcell has been fully processed
├── bam                      # output from basecalling in bam format (including modificaytion calls)
├── bam_demux                # demulitplex samples (empty if no barcoding)
├── benchmarks               # benchmarks for each rule
├── benchmarks_combined.tsv  # combined benchmark file
├── flags                    # directory with flags from snakemake rules
├── log                      # log files (rule-specific)
├── pipeline_config.yaml     # configfile (snakemake & more)
├── pod5                     # directory with merged pod5 file (from offloadDir)
├── reports                  # directory with reports and SampleSheet.csv (from offloadDir)
├── summary                  # summary files (DAG, disk status)
└── transfer                 # analysis output that will be transferred)

transfer/
└── Project_projectID_User_Group
    ├── Analysis_mouse_dna                    # analysis directory (exists only if genome is known)
    │   ├── 23L000329_WT_rep1.align.bam       # alignment
    │   ├── 23L000329_WT_rep1.align.bam.bai   # index
    │   └── 23L000329_WT_rep1.align.bed.gz    # modification calls
    ├── Data
    │   ├── 23L000329_WT_rep1.bam             # basecalled sequences
    │   ├── 23L000329_WT_rep1.fastq.gz        # basecalled sequences (fastq - deprecated)
    │   ├── 23L000329_WT_rep1_porechop.fastq.gz # adaptors, barcodes trimmed
    │   └── 23L000329_WT_rep1.seqsum            # sequencing summaries (for pycoQC etc )
    └── QC
        ├── multiqc
        │   ├── multiqc_data
        │   └── multiqc_report.html            # multiqc report
        ├── sample_names.tsv                   # dictionary sampleID-sampleName
        └── Samples                            # samples-wise quality controls
            ├── 23L000329_WT_rep1.align.flagstat
            ├── 23L000329_WT_rep1.align_pycoqc.html
            ├── 23L000329_WT_rep1.align_pycoqc.json
            ├── 23L000329_WT_rep1_fastqc.html
            ├── 23L000329_WT_rep1_fastqc.zip
            ├── 23L000329_WT_rep1_kraken.report
            ├── 23L000329_WT_rep1_porechop.info
            ├── 23L000329_WT_rep1_pycoqc.html
            ├── 23L000329_WT_rep1_pycoqc.json
            ├── all_porechop.best_end
            ├── all_porechop.best_start
            └── all_porechop.trimmed

Example output path for an end user (groupDir)

../user_path/to/flowcell/  (identical to outputDir/transfer)
.
├── metadata.yaml
└── Project_projectID_User_Group
    ├── Analysis_mouse_dna
    ├── Data
    └── QC