Metagenomic taxonomic classification benchmarking
This package is contains snakemake workflows, a jupyter notebook and a python package. This
directory can be used as the snakemake workflow directory (the easiest setup), but you can also
symlink in the Snakefile
, rules/
and config.yaml
paths, to keep it separate from the data.
- The
fastq/
folder should contain input fastq datasets compressed in.bz2
format. - The
db/
folder contains metagenomic databases, which can be downloaded fromgs://metax-bakeoff-2019
usinggsutil
. - The
src/
folder contains code for classifiers that don't have conda package available. Thesrc/local/
is a local path for installing packages akin to the/usr/local/
path, and contains compiled code (compatible with Ubuntu 16.04). Separate packages are installed undersrc/local/stow
if they are not installed to that prefix directly. Packages installed here may need to be recompiled on a different OS. All of the rest of the packages are installed viaconda
. There is one conda environmentmetax
based on python3 that serves as the default environment to use, while themetax_py2
environment is based on python2 needed by some packages, and will only be used to run specific classifiers. - The
envs/
contains environment yaml files that fully spec these two conda environments, while theconda_environment*.txt
files are their human-editable counterparts similar torequirements.txt
files with only a subset of critical packages version pinned, and not including transitive dependencies. - The
rules/
directory contains snakemake rules for each method. - The
tmp/
folder is the default local folder for storing temporary files needed by various methods. - The
log/
folder contains stdout/stderr logs. - The
time/
folder contains/usr/bin/time -v
logs, - The
benchmark/
folder contains snakemake benchmarking tsvs. - The
data/
folder is intended for storing the raw outputs of classifers, typically with a filename like<sample_name>.<classifier>.<database>.tsv.gz
. These raw outputs are typically individual read classifications, so they areO(n)
in the size of the input fastqs and compressed. - The
reports/
folder is intended for storing the taxonomy reports, which typically contain one line per taxa with their abundance or number of classified reads per sample. This file is much smaller andO(1)
in size compared to the inputs, and has a filename format similar todata/
. - The
info/
andsummary/
folders contain output summary data for datasets, such as the*.total_reads.txt
files for number of reads in the input fastqs, as well as things likesummary/classified_count.tsv
for number of classified reads if they are not determinable from output reports. - The
plotting/
folder contains thecompiled_reports.tsv
master tsv of all classification results generated bymetax compile
as well as a few other summary files. It also contains the jupyter notebook that will produce the final plots.
A default config.yaml
exists in config/config.yaml
. This config assumes that everything in the
workflow will be placed relative to this git directory root, including all of the databases,
inputs, outputs, code and non-conda software. The config also specifies the locations of the
various databases, so acquiring the database don't have to be downloaded from Google Cloud Storage,
but this also means that downloading databases and running classifiers are separate snakemake
invocations. Practical things to change in the config are TMPDIR
which is specified as tmp/
by
default, but can be changed to another physical drive to improve performance.
- Install miniconda https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html. Install gsutil https://cloud.google.com/storage/docs/gsutil_install for downloading of already hosted databases.
- Create two conda environments from the
envs/
directory.conda enc create -n metax --file envs/py3.yaml
. Alsoconda enc create -n metax_py2 --file envs/py2.yaml
. - Activate the
metax
conda environment. - Install this package itself as a python package.
pip install .
. - Install remaining software into
src/
usingsnakemake download_src
. These packages may need to be recompiled.
- Download any desired prerequisite databases using a snakemake rule. To download all databases,
use
snakemake download_db_all
. To download a specific database, use the general formsnakemake db/[default|refseqc]/[method]
, such assnakemake db/refseqc/kraken
orsnakemake db/default/kraken
. - Download existing fastq datasets using
snakemake download_fastq_all
. - Run snakemake rules to classify samples using rules such as
snakemake kraken_default_all
. Alternatively all classifiers can be run on all samples by runningsnakemake all
, but note the caveats. 1 - Run the snakemake rule to aggregate reports, read counts, and classified counts
snakemake compile_reports
. This will generate theplotting/compiled_reports.tsv
file with all of the samples and classifications. - Run the jupyter notebook
Metagenomics Bench.ipynb
to generate all plots.
To add a new classifier to compare against the existing results, the easiest way would be to append
the new method's classification results to the master plotting/compiled_reports.tsv
file, and
then simply run the jupyter notebook. Note that this would not add the compute time/resources
benchmark. The columns in this file are:
sample
: Sample nameclassifier
: Classifier namedatabase
: Name of the database, usually 'default' or 'refseqc'.taxid
: NCBI Taxonomy ID identified.cum_abundance
: Cumulative abundance (out of 1) of classified reads assigned to this taxid and its children.cls_cum_abundance
: Cumulative abundance (out of total classified abundance).unique_abundance
: Specific abundance classfied to this taxid not including children.cls_unique_abundance
: Specific abundance (out of total classified abundance).classrank
: The specific taxonomy rank profile level setting of the classifier. Either 'species', 'genus', or 'all' if the classifier assigns abundances to all levels of the taxonomy simultaneously.rank
: Taxonomy rank e.g. 'species', 'genus' etc.
Because the NCBI taxonomy tree and ids are constantly changing, you should use the same taxonomy
database used for all other methods which can be acquired from snakemake db/taxonomy
.
To add additional classifier methods, the best way would be to mimic an existing method. Create a
new snakemake file rules/newmethod.smk
and add include rules/newmethod.smk
to the master
Snakefile
. For examples see the other method Snakefiles like kaiju.smk
. Ideally you will create
snakemake rules for snakemake newmethod_default_all
and snakemake_refseqc_all
.
The parser for the raw classified reports is part of the metax
python package, in the command
metax compile
. This command will process all of the heterogenous report files in the reports/
folder and works by matching the filename to parser classes. This allows custom report format
parsing code for different classifiers, but ultimately, each parser class has a parse()
method
that yields python dicts for each taxa identified in a fastq file. These dicts contain mandatory
keys like classifier
, rank
, taxid
etc., and are eventually turned into tsv rows, so that all
of the rows (taxa * sample) are aggregated into a single harmonized output tsv for ingestion by a
jupyter notebook.
In metax/reporter.py
, create an additional class NewmethodParser()
. The easiest is to subclass
an existing method if your report output has a similar enough format to an existing method, such as
Kraken-like outputs and simple tsvs. For more complicated formats, you will need specialized
parsing code, and may require a full read of the report if things like abundance cannot be
calculated via streaming. For example, here is the MegablastParser
and the corresponding line
where it is registered to the Matchers
object.
class MegablastParser(KrakenParser):
MATCH_GLOB = '*.megablast.*.txt'
MATCH_RE = re.compile(r'(?P<fn_prefix>.*)\.megablast\.(?P<database>[^.]*)\.txt')
CLASSIFIER = 'megablast'
is_paired_count = False
class Matchers:
@classmethod
def create(cls, processor):
...
m.add('megablast', MegablastParser(processor))
After the snakemake compile_reports
step, the Metagenomics Bench.ipynb
notebook also needs to
be modified to incorporate additional methods. The classifiers
global variable in the notebook
needs to have an additional entry with your method of <classifier_codename>: <formal_classifier_name>
. Similarly, it needs to be added to refseqc_classifiers
if it has an
appropriate refseqc database and kept in main_classifiers
if it is to be included on all plots.
- Run the snakemake rule to aggregate reports, read counts, and classified counts
snakemake compile_reports
. This will generate theplotting/compiled_reports.tsv
file with all of the samples and classifications. - Run the jupyter notebook
plotting/Metagenomics Bench.ipynb
to generate all plots.
To add additional classifier methods, the best way would be to mimic an existing method. Create a
new snakemake file rules/newmethod.smk
and add include rules/newmethod.smk
to the master
Snakefile
. For examples see the other method Snakefiles like kaiju.smk
. Ideally you will create
snakemake rules for snakemake newmethod_default_all
and snakemake_refseqc_all
.
The parser for the raw classified reports is part of the metax
python package, in the command
metax compile
. This command will process all of the heterogenous report files in the reports/
folder and works by matching the filename to parser classes. This allows custom report format
parsing code for different classifiers, but ultimately, each parser class has a parse()
method
that yields python dicts for each taxa identified in a fastq file. These dicts contain mandatory
keys like classifier
, rank
, taxid
etc., and are eventually turned into tsv rows, so that all
of the rows (taxa * sample) are aggregated into a single harmonized output tsv for ingestion by a
jupyter notebook.
In metax/reporter.py
, create an additional class NewmethodParser()
. The easiest is to subclass
an existing method if your report output has a similar enough format to an existing method, such as
Kraken-like outputs and simple tsvs. For more complicated formats, you will need specialized
parsing code, and may require a full read of the report if things like abundance cannot be
calculated via streaming. For example, here is the MegablastParser
and the corresponding line
where it is registered to the Matchers
object.
class MegablastParser(KrakenParser):
MATCH_GLOB = '*.megablast.*.txt'
MATCH_RE = re.compile(r'(?P<fn_prefix>.*)\.megablast\.(?P<database>[^.]*)\.txt')
CLASSIFIER = 'megablast'
is_paired_count = False
class Matchers:
@classmethod
def create(cls, processor):
...
m.add('megablast', MegablastParser(processor))
After the snakemake compile_reports
step, the Metagenomics Bench.ipynb
notebook also needs to
be modified to incorporate additional methods. The classifiers
global variable in the notebook
needs to have an additional entry with your method of <classifier_codename>: <formal_classifier_name>
. Similarly, it needs to be added to refseqc_classifiers
if it has an
appropriate refseqc database and kept in main_classifiers
if it is to be included on all plots.
The source fastas for creating the refseqc databases are located at
gs://metax-bakeoff-2019/refseqc/fasta/genomic.fna.gz
and
gs://metax-bakeoff-2019/refseqc/fasta/protein.faa.gz
for DNA and protein databases respectively.
Precomputed mappings of the accession to taxonomy IDs are located at
gs://metax-bakeoff-2019/refseqc/fasta/genomic.taxids
and
gs://metax-bakeoff-2019/refseqc/fasta/protein.taxids
.
Compress and place the fastq as <sample_name>_1.fastq.bz2
and <sample_name>_2.fastq.bz2
for the
two paired ends and add the sample name to the config.yaml
under samples_pe
. If the file is
single-ended, place as <sample_name>.fastq.bz2
instead and add the sample name under samples_se
instead. Also add an entry to config/compile_config.yaml
under samples
, mapping the
<sample_name>
to a dict of metadata. The only required keys are sample:
indicating the desired
output name in the compiled_reports.tsv
and paired: True
if the sample is paired-ends.
Otherwise it is assumed to be single-ended.
Footnotes
-
Snakemake doesn't manage batching when sharing a common resource easily, like a large metagenomic database wanting all of its dependent jobs run serially so it won't have to reload the database from memory. To alleviate this and improve performance, run separate executions of snakemake for each method, such as
snakemake kraken_default_all; snakemake kraken_refseqc_all
;snakemake diamond_default_all
etc.2. Run snakemake rules to classify samples using rules such assnakemake kraken_default_all
. Alternatively all classifiers can be run on all samples by runningsnakemake all
, but note the caveats. 1 ↩ ↩2