Skip to content

microbiomedata/mg_annotation

Repository files navigation

Workflow for Metagenome annotation

This workflow takes assembled metagenomes and generates structural and functional annotations. It is based on the JGI/IMG annotation pipeline ([more details] (https://journals.asm.org/doi/10.1128/msystems.00804-20)) and uses a number of open-source tools and databases to generate the structural and functional annotations.

The input assembly is first split into 10MB splits to be processed in parallel. Depending on the workflow engine configuration, the split can be processed in parallel. Each split is first structurally annotated, then those results are used for the functional annotation. The structural annotation uses tRNAscan_se, RFAM, CRT, Prodigal and GeneMarkS. These results are merged to create a consensus structural annotation. The resulting GFF is the input for functional annotation which uses multiple protein family databases (SMART, COG, TIGRFAM, SUPERFAMILY, Pfam and Cath-FunFam) along with custom HMM models. The functional predictions are created using Last and HMM. These annotations are also merged into a consensus GFF file. Finally, the respective split annotations are merged together to generate a single structural annotation file and single functional annotation file. In addition, several summary files are generated in TSV format.

Running Workflow in Cromwell

Description of the files:

  • .wdl file: the WDL file for workflow definition
  • .json file: the example input for the workflow
  • .conf file: the conf file for running Cromwell.
  • .sh file: the shell script for running the example workflow

The Docker image can be found here

microbiomedata/img-omics:5.2.0

Input files

A JSON file containing the following:

  1. The path to the assembled contigs fasta file
  2. The ID to associate with the result products (e.g. sample ID)

Requirements for Execution (recommendations are in bold):

  • WDL-capable Workflow Execution Tool (Cromwell)
  • Container Runtime that can load Docker images (Docker v2.1.0.3 or higher)

Third party software used (+ their licenses)

Databases used (+ their licenses):

  • Rfam (public domain/CC0 1.0; more info)
  • KEGG (paid subscription, getting KOs/ECs indirectly via IMG NR; more info)
  • SMART (restrictive license/custom); more info
  • COG (copyright/unlicensed); more info
  • TIGRFAM (copyleft/LGPL 2.0 or later); more info
  • SUPERFAMILY (permissive/custom); more info
  • Pfam (public domain/ CC0 1.0); more info
  • Cath-FunFam (permissive/CC BY 4.0); more info