Skip to content

LaurenHuet/OceanOmics-OceanGenomes-ref-genomes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nextflow run with conda run with docker run with singularity Launch on Seqera Platform

Introduction

This pipeline is designed for the de novo genome assembly and analysis of high-quality marine vertebrate genomes as part of the Minderoo OceanOmics Ocean Genomes Project. It processes raw HiFi and Hi-C data, performs assembly, scaffolding, decontamination, generates key assembly statistics and prepares the genome for manual curation within pretext map.

OceanOmics Reference Genome Pipeline Overview

  1. Filter and convert bam files to fastq files (HiFiAdapterFilt)
  2. PacBio Read QC (FastQC)
  3. Count k-mers (Meryl)
  4. Estimate genome size (GenomeScope2)
  5. Assemble hifi data (Hifiasm)
  6. Assembly stats on hifi data (Gfastats)
  7. Illumina Read QC (FastQC)
  8. Assemble Pacbio & Illumina reads (Hifiasm)
  9. Assembly stats (Gfastats)
  10. Gene assembly QC (BUSCO)
  11. K-mer assembly QC (Merqury)
  12. Create index (Samtools)
  13. Index assemble and align Hi-C reads (BWA)
  14. Map pairs (Pairtools)
  15. Sort and index (Samtools)
  16. Create scaffold (YAHS)
  17. Create decontamination report (fcs-gx)
  18. Create decontamination report (Tiara)
  19. Filter decontaminated scaffolds (BBMap)
  20. Scaffold stats (Gfastats)
  21. Scaffold QC (BUSCO)
  22. Scaffold QC (Merqury)
  23. Generate coverage tracks (minimap2)
  24. Generate coverage tracks (`bedtools)
  25. Predict telomere locations (tidk)
  26. Align reads to scaffolds (BWA)
  27. Align reads to scaffolds (Pairtools)
  28. Generate pretext maps (PretextMap)
  29. Inject coverage tracks into pretext map (PretextGraph)
  30. Present QC for raw reads (MultiQC)

Usage

Note

If you are new to Nextflow, please refer to this page on how to set-up Nextflow.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

sample,hifi_dir,hic_dir,version,date,tolid,taxid
OG88,hifi_bams/OG89,hic_fastqs/OG89,hic1,v240101,OG88,163129
OG89,hifi_bams/OG89,,hifi1,v240202,OG88,163129
OG90,hifi_fastqs/OG90,hic_fastqs/OG90,hic1,v240303,OG88,163129

Each row represents a sample. The hifi_dir column must point to a directory that contains bam files or fastq files. The hic_dir column can point to a directory containing fastq files, however this column can be left blank if there isn't Hi-C data for this sample. Taxid refers to the NCBI taxon ID for that samples.

Now, you can run the pipeline using:

nextflow run Computational-Biology-OceanOmics/OceanGenomes-refgenomes \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR> \
   --buscodb /path/to/buscodb \
   --gxdb /path/to/gxdb \
   --binddir /scratch \
   --tempdir <tempdir>
   -c pawsey_profile.config \
    -resume \
    -with-report 

This repository contains a custom config file to run the pipeline on the pawsey supercomputer with slurm.

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs. For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

For details about the output files and reports, please refer to the output documentation.

Credits

Computational-Biology-OceanOmics/OceanOmics-OceanGenomes-ref-genomes was originally adapted from the Vertebrate Genome project Galaxy pipeline (https://galaxyproject.org/projects/vgp/) by Emma de Jong and was converted to Nextflow by Adam Bennett and Lauren Huet. This version was built on top of the nf-core template.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.