Skip to content

Use Pre Trained HMMs

Eduard Kerkhoven edited this page Sep 25, 2024 · 9 revisions

Use Pre-Trained Hidden Markov Models

For the de novo reconstruction of the genome-scale metabolic models from KEGG, the RAVEN function getKEGGModelForOrganism can use KEGG Orthology specific HMMs for a homology search. Such a choice is particularly suitable if the target species is not listed in KEGG Species List. This option does not require KEGG FTP Subscription and is recommended for most users. Considering all RAVEN versions, the two different pipelines were used to generate the KEGG Orthology specific HMM sets:

  • The current pipeline: CD-HIT was used to obtain the non-redundant representative KEGG Orthology protein sets. This program does the protein clustering using the defined identity overlap threshold values with the longest protein in the corresponding cluster. Multi sequence alignment with MAFFT was then performed for such non-redundant protein sets. Finally, multi-sequence alignments were used as input to HMMER to train the KEGG Orthology-specific HMM sets. The HMM archives contain only pre-trained HMMs.
  • The classic pipeline: No longer used since RAVEN 1.9.1. Only relevant for KEGG Release 58.1. No protein clustering before the multi-sequence alignment was considered. The multi-sequence alignment was performed with ClustalW2, whereas HMMs were trained with HMMER. The HMM archives contain pre-trained HMMs and multi-sequence alignment data.

The HMM sets can be downloaded automatically during the model reconstruction from KEGG (set the dataDir parameter in getKEGGModelForOrganism). Alternatively, the download links are provided below. The following HMM sets are available:

KEGG Release RAVEN Releases Dataset/dataDir Phylogeny Software Used CD-HIT Identity
105.0 RAVEN 2.8.0+ euk90_kegg105 Eukaryote cd-hit-v4.8.1
mafft-7.490
hmmer-3.3.2
90%
105.0 RAVEN 2.8.0+ prok90_kegg105 Prokaryote cd-hit-v4.8.1
mafft-7.490
hmmer-3.3.2
90%
102.0 RAVEN 2.7.4 - 2.7.9 euk90_kegg102 Eukaryote cd-hit-v4.8.1
mafft-7.490
hmmer-3.3.2
90%
102.0 RAVEN 2.7.4 - 2.7.9 prok90_kegg102 Prokaryote cd-hit-v4.8.1
mafft-7.490
hmmer-3.3.2
90%
100.0 RAVEN 2.6.0 - 2.7.3 euk90_kegg100 Eukaryote cd-hit-v4.8.1
mafft-7.490
hmmer-3.3.2
90%
100.0 RAVEN 2.6.0 - 2.7.3 prok90_kegg100 Prokaryote cd-hit-v4.8.1
mafft-7.490
hmmer-3.3.2
90%
94.0 RAVEN 2.4.0 - 2.5.3 euk100_kegg94 Eukaryote cd-hit-v4.8.1
mafft-7.490
hmmer-3.3.2
100%
94.0 RAVEN 2.4.0 - 2.5.3 euk90_kegg94 Eukaryote cd-hit-v4.8.1
mafft-7.490
hmmer-3.3.2
90%
94.0 RAVEN 2.4.0 - 2.5.3 euk50_kegg94 Eukaryote cd-hit-v4.8.1
mafft-7.490
hmmer-3.3.2
50%
94.0 RAVEN 2.4.0 - 2.5.3 prok100_kegg94 Prokaryote cd-hit-v4.8.1
mafft-7.490
hmmer-3.3.2
100%
94.0 RAVEN 2.4.0 - 2.5.3 prok90_kegg94 Prokaryote cd-hit-v4.8.1
mafft-7.490
hmmer-3.3.2
90%
94.0 RAVEN 2.4.0 - 2.5.3 prok50_kegg94 Prokaryote cd-hit-v4.8.1
mafft-7.490
hmmer-3.3.2
50%

The model can be reconstructed by running the following command:

model=getKEGGModelForOrganism('abc','inputFasta.fa','euk90_kegg104','outputDirectory',true,true,true,true,10^-50,0.8,0.3,-1,inf,1);

NOTE: Unlike in the model reconstruction based on KEGG Organism three-four letter code, the first input parameter is only used in model.id and does not influence the homology search in any way, so any string can be used here.