Skip to content

Latest commit

 

History

History
230 lines (199 loc) · 13.7 KB

README.md

File metadata and controls

230 lines (199 loc) · 13.7 KB

metax

Metagenomic taxonomic classification benchmarking

Overview

This package is contains snakemake workflows, a jupyter notebook and a python package. This directory can be used as the snakemake workflow directory (the easiest setup), but you can also symlink in the Snakefile, rules/ and config.yaml paths, to keep it separate from the data.

Organization

  • The fastq/ folder should contain input fastq datasets compressed in .bz2 format.
  • The db/ folder contains metagenomic databases, which can be downloaded from gs://metax-bakeoff-2019 using gsutil.
  • The src/ folder contains code for classifiers that don't have conda package available. The src/local/ is a local path for installing packages akin to the /usr/local/ path, and contains compiled code (compatible with Ubuntu 16.04). Separate packages are installed under src/local/stow if they are not installed to that prefix directly. Packages installed here may need to be recompiled on a different OS. All of the rest of the packages are installed via conda. There is one conda environment metax based on python3 that serves as the default environment to use, while the metax_py2 environment is based on python2 needed by some packages, and will only be used to run specific classifiers.
  • The envs/ contains environment yaml files that fully spec these two conda environments, while the conda_environment*.txt files are their human-editable counterparts similar to requirements.txt files with only a subset of critical packages version pinned, and not including transitive dependencies.
  • The rules/ directory contains snakemake rules for each method.
  • The tmp/ folder is the default local folder for storing temporary files needed by various methods.
  • The log/ folder contains stdout/stderr logs.
  • Thetime/ folder contains /usr/bin/time -v logs,
  • The benchmark/ folder contains snakemake benchmarking tsvs.
  • The data/ folder is intended for storing the raw outputs of classifers, typically with a filename like <sample_name>.<classifier>.<database>.tsv.gz. These raw outputs are typically individual read classifications, so they are O(n) in the size of the input fastqs and compressed.
  • The reports/ folder is intended for storing the taxonomy reports, which typically contain one line per taxa with their abundance or number of classified reads per sample. This file is much smaller and O(1) in size compared to the inputs, and has a filename format similar to data/.
  • The info/ and summary/ folders contain output summary data for datasets, such as the *.total_reads.txt files for number of reads in the input fastqs, as well as things like summary/classified_count.tsv for number of classified reads if they are not determinable from output reports.
  • The plotting/ folder contains the compiled_reports.tsv master tsv of all classification results generated by metax compile as well as a few other summary files. It also contains the jupyter notebook that will produce the final plots.

Configuration

A default config.yaml exists in config/config.yaml. This config assumes that everything in the workflow will be placed relative to this git directory root, including all of the databases, inputs, outputs, code and non-conda software. The config also specifies the locations of the various databases, so acquiring the database don't have to be downloaded from Google Cloud Storage, but this also means that downloading databases and running classifiers are separate snakemake invocations. Practical things to change in the config are TMPDIR which is specified as tmp/ by default, but can be changed to another physical drive to improve performance.

Install

  1. Install miniconda https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html. Install gsutil https://cloud.google.com/storage/docs/gsutil_install for downloading of already hosted databases.
  2. Create two conda environments from the envs/ directory. conda enc create -n metax --file envs/py3.yaml. Also conda enc create -n metax_py2 --file envs/py2.yaml.
  3. Activate the metax conda environment.
  4. Install this package itself as a python package. pip install ..
  5. Install remaining software into src/ using snakemake download_src. These packages may need to be recompiled.

Execution

  1. Download any desired prerequisite databases using a snakemake rule. To download all databases, use snakemake download_db_all. To download a specific database, use the general form snakemake db/[default|refseqc]/[method], such as snakemake db/refseqc/kraken or snakemake db/default/kraken.
  2. Download existing fastq datasets using snakemake download_fastq_all.
  3. Run snakemake rules to classify samples using rules such as snakemake kraken_default_all. Alternatively all classifiers can be run on all samples by running snakemake all, but note the caveats. 1
  4. Run the snakemake rule to aggregate reports, read counts, and classified counts snakemake compile_reports. This will generate the plotting/compiled_reports.tsv file with all of the samples and classifications.
  5. Run the jupyter notebook Metagenomics Bench.ipynb to generate all plots.

Adding Classifiers

To add a new classifier to compare against the existing results, the easiest way would be to append the new method's classification results to the master plotting/compiled_reports.tsv file, and then simply run the jupyter notebook. Note that this would not add the compute time/resources benchmark. The columns in this file are:

  • sample: Sample name
  • classifier: Classifier name
  • database: Name of the database, usually 'default' or 'refseqc'.
  • taxid: NCBI Taxonomy ID identified.
  • cum_abundance: Cumulative abundance (out of 1) of classified reads assigned to this taxid and its children.
  • cls_cum_abundance: Cumulative abundance (out of total classified abundance).
  • unique_abundance: Specific abundance classfied to this taxid not including children.
  • cls_unique_abundance: Specific abundance (out of total classified abundance).
  • classrank: The specific taxonomy rank profile level setting of the classifier. Either 'species', 'genus', or 'all' if the classifier assigns abundances to all levels of the taxonomy simultaneously.
  • rank: Taxonomy rank e.g. 'species', 'genus' etc.

Because the NCBI taxonomy tree and ids are constantly changing, you should use the same taxonomy database used for all other methods which can be acquired from snakemake db/taxonomy.

To add additional classifier methods, the best way would be to mimic an existing method. Create a new snakemake file rules/newmethod.smk and add include rules/newmethod.smk to the master Snakefile. For examples see the other method Snakefiles like kaiju.smk. Ideally you will create snakemake rules for snakemake newmethod_default_all and snakemake_refseqc_all.

Parsing Taxonomic Reports

The parser for the raw classified reports is part of the metax python package, in the command metax compile. This command will process all of the heterogenous report files in the reports/ folder and works by matching the filename to parser classes. This allows custom report format parsing code for different classifiers, but ultimately, each parser class has a parse() method that yields python dicts for each taxa identified in a fastq file. These dicts contain mandatory keys like classifier, rank, taxid etc., and are eventually turned into tsv rows, so that all of the rows (taxa * sample) are aggregated into a single harmonized output tsv for ingestion by a jupyter notebook.

In metax/reporter.py, create an additional class NewmethodParser(). The easiest is to subclass an existing method if your report output has a similar enough format to an existing method, such as Kraken-like outputs and simple tsvs. For more complicated formats, you will need specialized parsing code, and may require a full read of the report if things like abundance cannot be calculated via streaming. For example, here is the MegablastParser and the corresponding line where it is registered to the Matchers object.

class MegablastParser(KrakenParser):
    MATCH_GLOB = '*.megablast.*.txt'
    MATCH_RE = re.compile(r'(?P<fn_prefix>.*)\.megablast\.(?P<database>[^.]*)\.txt')
    CLASSIFIER = 'megablast'
    is_paired_count = False
    
class Matchers:
    @classmethod
    def create(cls, processor):
    	...
		m.add('megablast', MegablastParser(processor))

Jupyter Notebook Plots

After the snakemake compile_reports step, the Metagenomics Bench.ipynb notebook also needs to be modified to incorporate additional methods. The classifiers global variable in the notebook needs to have an additional entry with your method of <classifier_codename>: <formal_classifier_name>. Similarly, it needs to be added to refseqc_classifiers if it has an appropriate refseqc database and kept in main_classifiers if it is to be included on all plots.

  1. Run the snakemake rule to aggregate reports, read counts, and classified counts snakemake compile_reports. This will generate the plotting/compiled_reports.tsv file with all of the samples and classifications.
  2. Run the jupyter notebook plotting/Metagenomics Bench.ipynb to generate all plots.

Adding Classifiers

To add additional classifier methods, the best way would be to mimic an existing method. Create a new snakemake file rules/newmethod.smk and add include rules/newmethod.smk to the master Snakefile. For examples see the other method Snakefiles like kaiju.smk. Ideally you will create snakemake rules for snakemake newmethod_default_all and snakemake_refseqc_all.

Parsing Taxonomic Reports

The parser for the raw classified reports is part of the metax python package, in the command metax compile. This command will process all of the heterogenous report files in the reports/ folder and works by matching the filename to parser classes. This allows custom report format parsing code for different classifiers, but ultimately, each parser class has a parse() method that yields python dicts for each taxa identified in a fastq file. These dicts contain mandatory keys like classifier, rank, taxid etc., and are eventually turned into tsv rows, so that all of the rows (taxa * sample) are aggregated into a single harmonized output tsv for ingestion by a jupyter notebook.

In metax/reporter.py, create an additional class NewmethodParser(). The easiest is to subclass an existing method if your report output has a similar enough format to an existing method, such as Kraken-like outputs and simple tsvs. For more complicated formats, you will need specialized parsing code, and may require a full read of the report if things like abundance cannot be calculated via streaming. For example, here is the MegablastParser and the corresponding line where it is registered to the Matchers object.

class MegablastParser(KrakenParser):
    MATCH_GLOB = '*.megablast.*.txt'
    MATCH_RE = re.compile(r'(?P<fn_prefix>.*)\.megablast\.(?P<database>[^.]*)\.txt')
    CLASSIFIER = 'megablast'
    is_paired_count = False
    
class Matchers:
    @classmethod
    def create(cls, processor):
    	...
		m.add('megablast', MegablastParser(processor))

Jupyter Notebook Plots

After the snakemake compile_reports step, the Metagenomics Bench.ipynb notebook also needs to be modified to incorporate additional methods. The classifiers global variable in the notebook needs to have an additional entry with your method of <classifier_codename>: <formal_classifier_name>. Similarly, it needs to be added to refseqc_classifiers if it has an appropriate refseqc database and kept in main_classifiers if it is to be included on all plots.

Creating refseqc database

The source fastas for creating the refseqc databases are located at gs://metax-bakeoff-2019/refseqc/fasta/genomic.fna.gz and gs://metax-bakeoff-2019/refseqc/fasta/protein.faa.gz for DNA and protein databases respectively. Precomputed mappings of the accession to taxonomy IDs are located at gs://metax-bakeoff-2019/refseqc/fasta/genomic.taxids and gs://metax-bakeoff-2019/refseqc/fasta/protein.taxids.

Adding FASTQ datasets

Compress and place the fastq as <sample_name>_1.fastq.bz2 and <sample_name>_2.fastq.bz2 for the two paired ends and add the sample name to the config.yaml under samples_pe. If the file is single-ended, place as <sample_name>.fastq.bz2 instead and add the sample name under samples_se instead. Also add an entry to config/compile_config.yaml under samples, mapping the <sample_name> to a dict of metadata. The only required keys are sample: indicating the desired output name in the compiled_reports.tsv and paired: True if the sample is paired-ends. Otherwise it is assumed to be single-ended.

Footnotes

  1. Snakemake doesn't manage batching when sharing a common resource easily, like a large metagenomic database wanting all of its dependent jobs run serially so it won't have to reload the database from memory. To alleviate this and improve performance, run separate executions of snakemake for each method, such as snakemake kraken_default_all; snakemake kraken_refseqc_all; snakemake diamond_default_all etc.2. Run snakemake rules to classify samples using rules such as snakemake kraken_default_all. Alternatively all classifiers can be run on all samples by running snakemake all, but note the caveats. 1 2