EMC/MetaMicrobes is a bioinformatics pipeline that analyzes microbial signatures and AMR genes in metagenomic or metatranscriptomic data.
As input it requires a samplesheet with paths to paired-end, short-read, compressed FASTQ files. The pipeline performs quality control and trimming on the reads, filters out reads mapping to a specified host reference genome and taxonomically classifies the remaining reads. In addition it also detects antimicrobial resistance genes with two different approaches. As output, you receive all intermediate outputs as well as a BIOM file with the classifications and a MultiQC report of the QC metrics and tools used.
An overview of the steps implemented in MetaMicrobes is shown in the figure below:
And include the following:
- Quality control (
Fastp
andFastQC
) - Filter out reads mapping to a reference genome (
BWA-MEM2
andSamtools
) - Summarize mapping statistics (
Samtools
) - Convert SAM file to FASTQ (
Samtools
) - Detect AMR genes based on Hidden Markov Models (
fARGene
) - Taxonomic classification (
Kraken2
) - Visualize Kraken2 output with Krona (
KrakenTools
andKrona
) - Re-estimation of microbial abundances (
Bracken
) - Convert Kraken2 and Bracken outputs to BIOM (
Kraken-biom
) - Decontaminate based on a blacklist and whitelist (
QIIME2
) - Visualize microbial profiles with barcharts and heatmaps (
QIIME2
) - Assess microbial alpha and beta diversity (
QIIME2
) - Generate report with quality metrics and used tools (
MultiQC
)
To use MetaMicrobes on your machine, follow the steps below:
- Make sure you have correctly set-up Nextflow and it's dependencies
Note
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow.
- Clone this GitHub repository
- Prepare a samplesheet like the example below:
samplesheet.csv
:Each row represents a pair of fastq files.sample,fastq_1,fastq_2 CONTROL_1,BR_PVP_0705_R1.fastq.gz,BR_PVP_0705_R2.fastq.gz
Tip
If you don't have data available yet, or you want to test the pipeline first on a small dataset, use the data that comes with this repo. This data is subsampled from 3 RNA-seq samples with varying host contents, created by Marques et al. .
Tip
You can use the "samplesheeter.py" script that comes with this repo, a small command line tool that prepares the samplesheet for you based on a supplied data directory.
-
Download a FASTA file containing the reference genome you want to use for host depletion, for example GRCh38.
Optionally, create a BWA-MEM2 index of this reference file and built your preferred Kraken2/Bracken database. If you don't supply these to the pipeline, MetaMicrobes will index your reference genome for you and build the Kraken2/Bracken standard database.
-
Now, you can run the MetaMicrobes pipeline using:
nextflow run <path/to/EMC-MetaMicrobes/directory/> \ -profile <docker/singularity/conda/.../institute> \ --input samplesheet.csv \ --outdir <OUTDIR> \ --fasta <path/to/reference_genome_fasta>
Tip
Save time by changing the default "null" values in "nextflow.config" to the paths you will use most often. Values in this file will be overwritten by the values specified in the command.
If you have a pre-built bwa-mem2 index or Kraken2/Bracken database, use a command like this:
bash nextflow run <path/to/EMC-MetaMicrobes/directory/> \ -profile <docker/singularity/conda/.../institute> \ --input samplesheet.csv \ --outdir <OUTDIR> \ --fasta <path/to/reference_genome_fasta> \ --bwamem2_index <path/to/bwa_mem2_index> \ --kraken2_db <path/to/kraken2_db> \ --bracken_db <path/to/bracken_db>
If you want to change anything related to the QIIME2 downstream analysis or fARGene, use a command like this:
```bash
nextflow run <path/to/EMC-MetaMicrobes/directory/> \
-profile <docker/singularity/conda/.../institute> \
--input samplesheet.csv \
--outdir <OUTDIR> \
--fasta <path/to/reference_genome_fasta> \
--bwamem2_index <path/to/bwa_mem2_index> \
--kraken2_db <path/to/kraken2_db> \
--bracken_db <path/to/bracken_db> \
--whitelist <path/to/custom_whitelist> \
--blacklist <path/to/custom_blacklist> \
--sampling_dept 1000 \
--metadata <path/to/metadata> \
--fargene_hmmmodel "class_b_1_2"
```
EMC/metamicrobes was originally written by Birgit Rijvers.
If you would like to contribute to this pipeline, please see the contributing guidelines.
A list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.