This pipeline is designed for the de novo genome assembly and analysis of high-quality marine vertebrate genomes as part of the Minderoo OceanOmics Ocean Genomes Project. It processes raw HiFi and Hi-C data, performs assembly, scaffolding, decontamination, generates key assembly statistics and prepares the genome for manual curation within pretext map.
- Filter and convert bam files to fastq files (
HiFiAdapterFilt
) - PacBio Read QC (
FastQC
) - Count k-mers (
Meryl
) - Estimate genome size (
GenomeScope2
) - Assemble hifi data (
Hifiasm
) - Assembly stats on hifi data (
Gfastats
) - Illumina Read QC (
FastQC
) - Assemble Pacbio & Illumina reads (
Hifiasm
) - Assembly stats (
Gfastats
) - Gene assembly QC (
BUSCO
) - K-mer assembly QC (
Merqury
) - Create index (
Samtools
) - Index assemble and align Hi-C reads (
BWA
) - Map pairs (
Pairtools
) - Sort and index (
Samtools
) - Create scaffold (
YAHS
) - Create decontamination report (
fcs-gx
) - Create decontamination report (
Tiara
) - Filter decontaminated scaffolds (
BBMap
) - Scaffold stats (
Gfastats
) - Scaffold QC (
BUSCO
) - Scaffold QC (
Merqury
) - Generate coverage tracks (
minimap2
) - Generate coverage tracks (`bedtools)
- Predict telomere locations (
tidk
) - Align reads to scaffolds (
BWA
) - Align reads to scaffolds (
Pairtools
) - Generate pretext maps (
PretextMap
) - Inject coverage tracks into pretext map (
PretextGraph
) - Present QC for raw reads (
MultiQC
)
Note
If you are new to Nextflow, please refer to this page on how to set-up Nextflow.
First, prepare a samplesheet with your input data that looks as follows:
samplesheet.csv
:
sample,hifi_dir,hic_dir,version,date,tolid,taxid
OG88,hifi_bams/OG89,hic_fastqs/OG89,hic1,v240101,OG88,163129
OG89,hifi_bams/OG89,,hifi1,v240202,OG88,163129
OG90,hifi_fastqs/OG90,hic_fastqs/OG90,hic1,v240303,OG88,163129
Each row represents a sample. The hifi_dir column must point to a directory that contains bam files or fastq files. The hic_dir column can point to a directory containing fastq files, however this column can be left blank if there isn't Hi-C data for this sample. Taxid refers to the NCBI taxon ID for that samples.
Now, you can run the pipeline using:
nextflow run Computational-Biology-OceanOmics/OceanGenomes-refgenomes \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--outdir <OUTDIR> \
--buscodb /path/to/buscodb \
--gxdb /path/to/gxdb \
--binddir /scratch \
--tempdir <tempdir>
-c pawsey_profile.config \
-resume \
-with-report
This repository contains a custom config file to run the pipeline on the pawsey supercomputer with slurm.
Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file
option. Custom config files including those provided by the -c
Nextflow option can be used to provide any configuration except for parameters;
see docs.
For more details and further functionality, please refer to the usage documentation and the parameter documentation.
For details about the output files and reports, please refer to the output documentation.
Computational-Biology-OceanOmics/OceanOmics-OceanGenomes-ref-genomes was originally adapted from the Vertebrate Genome project Galaxy pipeline (https://galaxyproject.org/projects/vgp/) by Emma de Jong and was converted to Nextflow by Adam Bennett and Lauren Huet. This version was built on top of the nf-core template.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.
You can cite the nf-core
publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.