Skip to content

Latest commit

 

History

History
125 lines (100 loc) · 7.67 KB

README.md

File metadata and controls

125 lines (100 loc) · 7.67 KB

Introduction

EMC/MetaMicrobes is a bioinformatics pipeline that analyzes microbial signatures and AMR genes in metagenomic or metatranscriptomic data.

As input it requires a samplesheet with paths to paired-end, short-read, compressed FASTQ files. The pipeline performs quality control and trimming on the reads, filters out reads mapping to a specified host reference genome and taxonomically classifies the remaining reads. In addition it also detects antimicrobial resistance genes with two different approaches. As output, you receive all intermediate outputs as well as a BIOM file with the classifications and a MultiQC report of the QC metrics and tools used.

An overview of the steps implemented in MetaMicrobes is shown in the figure below: Metrochart_CanMic_overview-horizontal_mqc_amr_q2 drawio

And include the following:

  1. Quality control (Fastp and FastQC)
  2. Filter out reads mapping to a reference genome (BWA-MEM2 and Samtools)
  3. Summarize mapping statistics (Samtools)
  4. Convert SAM file to FASTQ (Samtools)
  5. Detect AMR genes based on Hidden Markov Models (fARGene)
  6. Taxonomic classification (Kraken2)
  7. Visualize Kraken2 output with Krona (KrakenTools and Krona)
  8. Re-estimation of microbial abundances (Bracken)
  9. Convert Kraken2 and Bracken outputs to BIOM (Kraken-biom)
  10. Decontaminate based on a blacklist and whitelist (QIIME2)
  11. Visualize microbial profiles with barcharts and heatmaps (QIIME2)
  12. Assess microbial alpha and beta diversity (QIIME2)
  13. Generate report with quality metrics and used tools (MultiQC)

Usage

To use MetaMicrobes on your machine, follow the steps below:

  1. Make sure you have correctly set-up Nextflow and it's dependencies

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow.

  1. Clone this GitHub repository
  2. Prepare a samplesheet like the example below: samplesheet.csv:
    sample,fastq_1,fastq_2
    CONTROL_1,BR_PVP_0705_R1.fastq.gz,BR_PVP_0705_R2.fastq.gz
    
    Each row represents a pair of fastq files.

Tip

If you don't have data available yet, or you want to test the pipeline first on a small dataset, use the data that comes with this repo. This data is subsampled from 3 RNA-seq samples with varying host contents, created by Marques et al. .

Tip

You can use the "samplesheeter.py" script that comes with this repo, a small command line tool that prepares the samplesheet for you based on a supplied data directory.

  1. Download a FASTA file containing the reference genome you want to use for host depletion, for example GRCh38.

    Optionally, create a BWA-MEM2 index of this reference file and built your preferred Kraken2/Bracken database. If you don't supply these to the pipeline, MetaMicrobes will index your reference genome for you and build the Kraken2/Bracken standard database.

  2. Now, you can run the MetaMicrobes pipeline using:

    nextflow run <path/to/EMC-MetaMicrobes/directory/> \
       -profile <docker/singularity/conda/.../institute> \
       --input samplesheet.csv \
       --outdir <OUTDIR> \
       --fasta <path/to/reference_genome_fasta>

Tip

Save time by changing the default "null" values in "nextflow.config" to the paths you will use most often. Values in this file will be overwritten by the values specified in the command.

If you have a pre-built bwa-mem2 index or Kraken2/Bracken database, use a command like this: bash nextflow run <path/to/EMC-MetaMicrobes/directory/> \ -profile <docker/singularity/conda/.../institute> \ --input samplesheet.csv \ --outdir <OUTDIR> \ --fasta <path/to/reference_genome_fasta> \ --bwamem2_index <path/to/bwa_mem2_index> \ --kraken2_db <path/to/kraken2_db> \ --bracken_db <path/to/bracken_db>

If you want to change anything related to the QIIME2 downstream analysis or fARGene, use a command like this:
  ```bash
nextflow run <path/to/EMC-MetaMicrobes/directory/> \
   -profile <docker/singularity/conda/.../institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR> \
   --fasta <path/to/reference_genome_fasta> \
   --bwamem2_index <path/to/bwa_mem2_index> \
   --kraken2_db <path/to/kraken2_db> \
   --bracken_db <path/to/bracken_db> \
   --whitelist <path/to/custom_whitelist> \
   --blacklist <path/to/custom_blacklist> \
   --sampling_dept 1000 \
   --metadata <path/to/metadata> \
   --fargene_hmmmodel "class_b_1_2"
```  

Credits

EMC/metamicrobes was originally written by Birgit Rijvers.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

A list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.