This repository contains a nextflow pipeline for denovo genome assembly of long ONT reads. This pipeline was built to support the genome assembly and analysis of the 1KSA project - A pilot project aimed at sequencing and assembling indigenous South African species. Detailed step by step instructions on how to run this pipeline on the CHPC can be found under files, labelled Instructions.txt.
This workflow uses the following tools:
- Dorado for basecalling
- Samtools for converting bam files to fastq files
- Nanoplot for quality check
- Nanofilt for filtering and trimming
- Flye for genome assembly
- Racon for first round assembly polishing
- Medaka for second round assembly polishing
- BUSCO for assembly quality assessment
- QUAST for assembly quality assessment
- KMC for counting of k-mers in DNA
If basecalling was done on the sequencing machine, the script can be used to run genome assembly once the fastq files have been concatinated.
The following modules need to be loaded on the CHPC before running the pipeline:
- module purge
- module load chpc/BIOMODULES
- module load dorado
- module load samtools/1.9
- module load nanoplot
- module load nanofilt
- module load flye/2.9
- module load minimap2
- module load racon/1.5.0
- module load medaka/1.11.3
- module load quast/4.6.3
- module load busco/5.4.5
- module load bbmap/38.95
- module load metaeuk
- module load python
- module load R
- module load KMC
- module load nextflow/23.10.0-all
The following models and databases need to be downloaded before running the pipeline:
- Dorado: dorado download --model [email protected]
- Busco: busco --download eukaryota_odb10
To obtain the workflow, having installed nextflow, users can run:
- nextflow run --help to see the options for the workflow.
The primary outputs of the pipeline include:
- A fastq quality control report
- 3 assembled fasta files (From Flye, Racon and Medaka)
- A busco report
- A quast quality report