-
Notifications
You must be signed in to change notification settings - Fork 44
Substitution Models
layout: userdoc title: "Substitution Models" author: AUTHOR date: DATE docid: 10 icon: book doctype: manual tags:
- manual description: All common substitution models and usages. sections:
- name: DNA models url: dna-models
- name: Protein models url: protein-models
- name: Codon models url: codon-models
- name: Binary, morphological models url: binary-and-morphological-models
- name: Ascertainment bias correction url: ascertainment-bias-correction
- name: Rate heterogeneity across sites url: rate-heterogeneity-across-sites
All common substitution models and usages.
IQ-TREE supports a wide range of substitution models, including advanced partition and mixture models. This guide gives a detailed information of all available models.
TIP: If you do not know which model to use, simply run IQ-TREE with the standard model selection (
-m TEST
option) or the new ModelFinder (-m MFP
). It automatically determines best-fit model for your data. {: .tip}
IQ-TREE includes all common DNA models (ordered by complexity):
Model | df | Explanation | Code |
---|---|---|---|
JC or JC69 | 0 | Equal substitution rates and equal base frequencies (Jukes and Cantor, 1969). | 000000 |
F81 | 3 | Equal rates but unequal base freq. (Felsenstein, 1981). | 000000 |
K80 or K2P | 1 | Unequal transition/transversion rates and equal base freq. (Kimura, 1980). | 010010 |
HKY or HKY85 | 4 | Unequal transition/transversion rates and unequal base freq. (Hasegawa, Kishino and Yano, 1985). | 010010 |
TN or TN93 | 5 | Like HKY but unequal purine/pyrimidine rates (Tamura and Nei, 1993). |
010020 |
TNe | 2 | Like TN but equal base freq. |
010020 |
K81 or K3P | 2 | Three substitution types model and equal base freq. (Kimura, 1981). | 012210 |
K81u | 5 | Like K81 but unequal base freq. |
012210 |
TPM2 | 2 | AC=AT, AG=CT, CG=GT and equal base freq. | 010212 |
TPM2u | 5 | Like TPM2 but unequal base freq. |
010212 |
TPM3 | 2 | AC=CG, AG=CT, AT=GT and equal base freq. | 012012 |
TPM3u | 5 | Like TPM3 but unequal base freq. |
012012 |
TIM | 6 | Transition model, AC=GT, AT=CG and unequal base freq. | 012230 |
TIMe | 3 | Like TIM but equal base freq. |
012230 |
TIM2 | 6 | AC=AT, CG=GT and unequal base freq. | 010232 |
TIM2e | 3 | Like TIM2 but equal base freq. |
010232 |
TIM3 | 6 | AC=CG, AT=GT and unequal base freq. | 012032 |
TIM3e | 3 | Like TIM3 but equal base freq. |
012032 |
TVM | 7 | Transversion model, AG=CT and unequal base freq. | 012314 |
TVMe | 4 | Like TVM but equal base freq. |
012314 |
SYM | 5 | Symmetric model with unequal rates but equal base freq. (Zharkikh, 1994). | 012345 |
GTR | 8 | General time reversible model with unequal rates and unequal base freq. (Tavare, 1986). | 012345 |
The last column Code
is a 6-digit code defining the equality constraints for 6 relative substitution rates: A-C, A-G, A-T, C-G, C-T and G-T. 010010
means that A-G rate is equal to C-T rate (corresponding to 1
in the code) and the remaining four substitution rates are equal (corresponding to 0
in the code). Thus, 010010
is equivalent to K80 or HKY model (depending on whether base frequencies are equal or not). 012345
is equivalent to GTR or SYM model as there is no restriction defined by such 6-digit code.
Moreover, IQ-TREE supports arbitrarily restricted DNA model via a 6-digit code, e.g. with option -m 012012+G
.
NOTE: The digits in the codes do not necessarily have to have the same order as above. That means '101101' describes the same matrix as '010010'. The last rate, which corresponds to G-T, (and all rates with the same digit) is always set equal to 1.0 for convenience because the rates are relative.
If users want to fix model parameters, append the model name with a curly bracket {
, followed by the comma-separated rate parameters, and a closing curly bracket }
. For example, GTR{1.0,2.0,1.5,3.7,2.8}
specifies 6 substitution rates A-C=1.0, A-G=2.0, A-T=1.5, C-G=3.7, C-T=2.8 and G-T=1.0.
Another example is for model TIM2
that has the 6-digit code 010232
. Thus, TIM2{4.39,5.30,12.1}
means that A-C=A-T=4.39 (coded 0
), A-G=5.30 (coded 1
), C-T=12.1 (coded 3
) and C-G=G-T=1.0 (coded 2
). This is, in turn, equivalent to specifying GTR{4.39,5.30,4.39,1.0,12.1}
.
Users can specify three different kinds of base frequencies:
FreqType | Explanation |
---|---|
+F | Empirical base frequencies. This is the default if the model has unequal base freq. |
+FQ | Equal base frequencies. |
+FO | Optimized base frequencies by maximum-likelihood. |
For example, GTR+FO
optimizes base frequencies by ML whereas GTR+F
(default) counts base frequencies directly from the alignment.
Finally, users can fix base frequencies with e.g. GTR+F{0.1,0.2,0.3,0.4}
to fix the corresponding frequencies of A, C, G and T (must sum up to 1.0).
Starting with version 1.6, IQ-TREE supports a series of Lie Markov models (Woodhams et al., 2015), many of which are non-reversible models. Lie Markov models have a consistent property, which is lacking in other common models such as GTR. The following table shows the list of all Lie Markov models (the number before .
in the name shows the number of parameters of the model):
Model | Rev? | Freq | Note |
---|---|---|---|
1.1 | Yes | 0 | equiv. to JC |
2.2b | Yes | 0 | equiv. to K2P |
3.3a | Yes | 0 | equiv. to K3P |
3.3b | No | 0 | |
3.3c | Yes | 0 | equiv. to TNe |
3.4 | Yes | 1 | |
4.4a | Yes | 3 | equiv. to F81 |
4.4b | Yes | 1 | |
4.5a | No | 1 | |
4.5b | No | 1 | |
5.6a | No | 0 | |
5.6b | No | 3 | |
5.7a | No | 2 | |
5.7b | No | 0 | |
5.7c | No | 0 | |
5.11a | No | 2 | |
5.11b | No | 0 | |
5.11c | No | 0 | |
5.16 | No | 1 | |
6.6 | No | 1 | equiv. to STRSYM (strand symmetric model) |
6.7a | No | 3 | F81+K3P |
6.7b | No | 3 | |
6.8a | No | 3 | |
6.8b | No | 1 | |
6.17a | No | 1 | |
6.17b | No | 1 | |
8.8 | No | 3 | |
8.10a | No | 3 | |
8.10b | No | 1 | |
8.16 | No | 3 | |
8.17 | No | 3 | |
8.18 | No | 3 | |
9.20a | No | 2 | |
9.20b | No | 0 | Doubly stochastic |
10.12 | No | 3 | |
10.34 | No | 3 | |
12.12 | No | 3 | equiv. to UNREST (unrestricted model) |
Column Rev? shows whether the model is reversible or not. Column Freq shows the number of free base frequencies. 0 means equal base frequency; 1 means f(A)=f(G) and f(C)=f(T); 2 means f(A)+f(G)=0.5=f(C)+f(T); 3 means unconstrained frequencies.
All Lie Markov models can have one of the following prefices:
Prefix | Meaning |
---|---|
RY | purine-pyrimidine pairing (default) |
WS | weak-strong pairing |
MK | aMino-Keto pairing |
IQ-TREE supports all common empirical amino-acid exchange rate matrices (alphabetical order):
Model | Region | Explanation |
---|---|---|
Blosum62 | nuclear | BLOcks SUbstitution Matrix (Henikoff and Henikoff, 1992). Note that BLOSUM62 is not recommended for phylogenetic analysis as it was designed mainly for sequence alignments. |
cpREV | chloroplast | chloroplast matrix (Adachi et al., 2000). |
Dayhoff | nuclear | General matrix (Dayhoff et al., 1978). |
DCMut | nuclear | Revised Dayhoff matrix (Kosiol and Goldman, 2005). |
FLU | viral | Influenza virus (Dang et al., 2010). |
HIVb | viral | HIV between-patient matrix HIV-Bm (Nickle et al., 2007). |
HIVw | viral | HIV within-patient matrix HIV-Wm (Nickle et al., 2007). |
JTT | nuclear | General matrix (Jones et al., 1992). |
JTTDCMut | nuclear | Revised JTT matrix (Kosiol and Goldman, 2005). |
LG | nuclear | General matrix (Le and Gascuel, 2008). |
mtART | mitochondrial | Mitochondrial Arthropoda (Abascal et al., 2007). |
mtMAM | mitochondrial | Mitochondrial Mammalia (Yang et al., 1998). |
mtREV | mitochondrial | Mitochondrial Vertebrate (Adachi and Hasegawa, 1996). |
mtZOA | mitochondrial | Mitochondrial Metazoa (Animals) (Rota-Stabelli et al., 2009). |
mtMet | mitochondrial | Mitochondrial Metazoa (Vinh et al., 2017). |
mtVer | mitochondrial | Mitochondrial Vertebrate (Vinh et al., 2017). |
mtInv | mitochondrial | Mitochondrial Inverterbrate (Vinh et al., 2017). |
Poisson | none | Equal amino-acid exchange rates and frequencies. |
PMB | nuclear | Probability Matrix from Blocks, revised BLOSUM matrix (Veerassamy et al., 2004). |
rtREV | viral | Retrovirus (Dimmic et al., 2002). |
VT | nuclear | General 'Variable Time' matrix (Mueller and Vingron, 2000). |
WAG | nuclear | General matrix (Whelan and Goldman, 2001). |
GTR20 | general | General time reversible models with 190 rate parameters. WARNING: Be careful when using this parameter-rich model as parameter estimates might not be stable, especially when not having enough phylogenetic information (e.g. not long enough alignments). |
IQ-TREE also supports a series of protein mixture models:
Model | Explanation |
---|---|
C10 to C60 | 10, 20, 30, 40, 50, 60-profile mixture models (Le et al., 2008a) as variants of the CAT model (Lartillot and Philippe, 2004) for ML. Note that these models assume Poisson AA replacement and implicitly include a Gamma rate heterogeneity among sites. |
EX2 | Two-matrix model for exposed/buried AA sites (Le et al., 2008b). |
EX3 | Three-matrix model for highly exposed/intermediate/buried AA sites (Le et al., 2008b). |
EHO | Three-matrix model for extended/helix/other sites (Le et al., 2008b). |
UL2, UL3 | Unsupervised-learning variants of EX2 and EX3 , respectively. |
EX_EHO | Six-matrix model combining EX2 and EHO (Le and Gascuel, 2010). |
LG4M | Four-matrix model fused with Gamma rate heterogeneity (Le et al., 2012). |
LG4X | Four-matrix model fused with FreeRate heterogeneity (Le et al., 2012). |
CF4 | Five-profile mixture model (Wang et al., 2008). |
One can even combine a protein matrix with a profile mixture model like:
-
LG+C20
: ApplyingLG
matrix instead ofPoisson
for all 20 classes of AA profiles and a Gamma rate heterogeneity. -
LG+C20+F
: ApplyingLG
matrix for 20 classes plus the 21th class of empirical AA profile (counted from the current data) and Gamma rate heterogeneity. -
JTT+CF4+G
: ApplyingJTT
matrix for all 5 classes of AA profiles and Gamma rate heteorogeneity.
Moreover, one can override the Gamma rate by FreeRate heterogeneity:
-
LG+C20+R4
: LikeLG+C20
but replace Gamma by FreeRate heterogeneity.
If the matrix name does not match any of the above listed models, IQ-TREE assumes that it is a file containing AA exchange rates and frequencies in PAML format. It contains the lower diagonal part of the matrix and 20 AA frequencies, e.g.:
0.425093
0.276818 0.751878
0.395144 0.123954 5.076149
2.489084 0.534551 0.528768 0.062556
0.969894 2.807908 1.695752 0.523386 0.084808
1.038545 0.363970 0.541712 5.243870 0.003499 4.128591
2.066040 0.390192 1.437645 0.844926 0.569265 0.267959 0.348847
0.358858 2.426601 4.509238 0.927114 0.640543 4.813505 0.423881 0.311484
0.149830 0.126991 0.191503 0.010690 0.320627 0.072854 0.044265 0.008705 0.108882
0.395337 0.301848 0.068427 0.015076 0.594007 0.582457 0.069673 0.044261 0.366317 4.145067
0.536518 6.326067 2.145078 0.282959 0.013266 3.234294 1.807177 0.296636 0.697264 0.159069 0.137500
1.124035 0.484133 0.371004 0.025548 0.893680 1.672569 0.173735 0.139538 0.442472 4.273607 6.312358 0.656604
0.253701 0.052722 0.089525 0.017416 1.105251 0.035855 0.018811 0.089586 0.682139 1.112727 2.592692 0.023918 1.798853
1.177651 0.332533 0.161787 0.394456 0.075382 0.624294 0.419409 0.196961 0.508851 0.078281 0.249060 0.390322 0.099849 0.094464
4.727182 0.858151 4.008358 1.240275 2.784478 1.223828 0.611973 1.739990 0.990012 0.064105 0.182287 0.748683 0.346960 0.361819 1.338132
2.139501 0.578987 2.000679 0.425860 1.143480 1.080136 0.604545 0.129836 0.584262 1.033739 0.302936 1.136863 2.020366 0.165001 0.571468 6.472279
0.180717 0.593607 0.045376 0.029890 0.670128 0.236199 0.077852 0.268491 0.597054 0.111660 0.619632 0.049906 0.696175 2.457121 0.095131 0.248862 0.140825
0.218959 0.314440 0.612025 0.135107 1.165532 0.257336 0.120037 0.054679 5.306834 0.232523 0.299648 0.131932 0.481306 7.803902 0.089613 0.400547 0.245841 3.151815
2.547870 0.170887 0.083688 0.037967 1.959291 0.210332 0.245034 0.076701 0.119013 10.649107 1.702745 0.185202 1.898718 0.654683 0.296501 0.098369 2.188158 0.189510 0.249313
0.079066 0.055941 0.041977 0.053052 0.012937 0.040767 0.071586 0.057337 0.022355 0.062157 0.099081 0.064600 0.022951 0.042302 0.044040 0.061197 0.053287 0.012066 0.034155 0.069147
(This is an example of an LG matrix taken from PAML package). Note that the amino-acid order in this file is:
A R N D C Q E G H I L K M F P S T W Y V
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
By default, AA frequencies are given by the model. Users can change this with:
FreqType | Explanation |
---|---|
+F | empirical AA frequencies from the data. |
+FO | ML optimized AA frequencies from the data. |
+FQ | Equal AA frequencies. |
Users can also specify AA frequencies with, e.g.:
+F{0.079066,0.055941,0.041977,0.053052,0.012937,0.040767,0.071586,0.057337,0.022355,0.062157,0.099081,0.064600,0.022951,0.042302,0.044040,0.061197,0.053287,0.012066,0.034155,0.069147}
(Example corresponds to the AA frequencies of the LG matrix).
To apply a codon model one should use the option -st CODON
to tell IQ-TREE that the alignment contains protein coding sequences (otherwise, IQ-TREE thinks that it contains DNA sequences and will apply DNA models). This implicitly applies the standard genetic code. You can change to an other genetic code by appending the appropriate ID to the CODON
keyword:
Code | Genetic code meaning |
---|---|
CODON1 | The Standard Code (same as -st CODON ) |
CODON2 | The Vertebrate Mitochondrial Code |
CODON3 | The Yeast Mitochondrial Code |
CODON4 | The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code |
CODON5 | The Invertebrate Mitochondrial Code |
CODON6 | The Ciliate, Dasycladacean and Hexamita Nuclear Code |
CODON9 | The Echinoderm and Flatworm Mitochondrial Code |
CODON10 | The Euplotid Nuclear Code |
CODON11 | The Bacterial, Archaeal and Plant Plastid Code |
CODON12 | The Alternative Yeast Nuclear Code |
CODON13 | The Ascidian Mitochondrial Code |
CODON14 | The Alternative Flatworm Mitochondrial Code |
CODON16 | Chlorophycean Mitochondrial Code |
CODON21 | Trematode Mitochondrial Code |
CODON22 | Scenedesmus obliquus Mitochondrial Code |
CODON23 | Thraustochytrium Mitochondrial Code |
CODON24 | Pterobranchia Mitochondrial Code |
CODON25 | Candidate Division SR1 and Gracilibacteria Code |
(The IDs follow the specification at http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi).
IQ-TREE supports several codon models:
Model | Explanation |
---|---|
MG | Nonsynonymous/synonymous (dn/ds) rate ratio (Muse and Gaut, 1994). |
MGK | Like MG with additional transition/transversion (ts/tv) rate ratio. |
MG1KTS or MGKAP2 | Like MG with a transition rate (Kosiol et al., 2007). |
MG1KTV or MGKAP3 | Like MG with a transversion rate (Kosiol et al., 2007). |
MG2K or MGKAP4 | Like MG with a transition rate and a transversion rate (Kosiol et al., 2007). |
GY | Nonsynonymous/synonymous and transition/transversion rate ratios (Goldman and Yang, 1994). |
GY1KTS or GYKAP2 | Like GY with a transition rate (Kosiol et al., 2007). |
GY1KTV or GYKAP3 | Like GY with a transversion rate (Kosiol et al., 2007). |
GY2K or GYKAP4 | Like GY with a transition rate and a transversion rate (Kosiol et al., 2007). |
ECMK07 or KOSI07 | Empirical codon model (Kosiol et al., 2007). |
ECMrest | Restricted version of ECMK07 that allows only one nucleotide exchange. |
ECMS05 or SCHN05 | Empirical codon model (Schneider et al., 2005). |
The last three models (ECMK07
, ECMrest
or ECMS05
) are called empirical codon models, whereas the others are called mechanistic codon models.
Moreover, IQ-TREE supports combined empirical-mechanistic codon models using an underscore separator (_
). For example:
-
ECMK07_GY2K
: The combinedECMK07
andGY2K
model, with the rate entries being multiplication of the two corresponding rate matrices.
Thus, there can be many such combinations.
Starting with version 1.5.6: If the model name does not match any of the above listed models, IQ-TREE assumes that it is a file containing lower diagonal part of non-stop codon exchange rate matrix, non-stop codon frequencies and a list of non-stop codons. The rest of the file will be ignored. Example files (ECMrest.dat and ECMunrest.dat) can be downloaded from the supplementary material (Kosiol et al., 2007).
NOTE: Branch lengths under codon models are interpreted as number of nucleotide substitutions per codon site. Thus, they are typically 3 times longer than under DNA models.
IQ-TREE supports the following codon frequencies:
FreqType | Explanation |
---|---|
+F | Empirical codon frequencies counted from the data. |
+FQ | Equal codon frequencies. |
+F1X4 | Unequal nucleotide frequencies but equal nt frequencies over three codon positions. |
+F3X4 | Unequal nucleotide frequencies and unequal nt frequencies over three codon positions. |
If not specified, the default codon frequency will be +F3X4
for MG
-type models, +F
for GY
-type models and given by the model for empirical codon models.
The binary alignments should contain state 0
and 1
, whereas for morphological data, the valid states are 0
to 9
and A
to Z
.
Model | Explanation |
---|---|
JC2 | Jukes-Cantor type model for binary data. |
GTR2 | General time reversible model for binary data. |
MK | Jukes-Cantor type model for morphological data. |
ORDERED | Allowing exchange of neighboring states only. |
Except for GTR2
that has unequal state frequencies, all other models have equal state frequencies.
TIP: If morphological alignments do not contain constant sites (typically the case), then an ascertainment bias correction model (
+ASC
) should be applied to correct the branch lengths for the absence of constant sites. {: .tip}
An ascertainment bias correction (+ASC
) model (Lewis, 2001) should be applied if the alignment does not contain constant sites (such as morphological or SNPs data). For example:
-
MK+ASC
: For morphological data. -
GTR+ASC
: For SNPs data.
+ASC
will correct the likelihood conditioned on variable sites. Without +ASC
, the branch lengths might be overestimated.
IQ-TREE supports all common rate heterogeneity across sites models:
RateType | Explanation |
---|---|
+I | allowing for a proportion of invariable sites. |
+G | discrete Gamma model (Yang, 1994) with default 4 rate categories. The number of categories can be changed with e.g. +G8 . |
+I+G | invariable site plus discrete Gamma model (Gu et al., 1995). |
+R | FreeRate model (Yang, 1995; Soubrier et al., 2012) that generalizes the +G model by relaxing the assumption of Gamma-distributed rates. The number of categories can be specified with e.g. +R6 (default 4 categories if not specified). The FreeRate model typically fits data better than the +G model and is recommended for analysis of large data sets. |
+I+R | invariable site plus FreeRate model. |
TIP: The new ModelFinder (
-m MFP
option) tests the FreeRate model, whereas the standard procedure (-m TEST
) does not. {: .tip}
Users can fix the parameters of the model. For example, +I{0.2}
will fix the proportion of invariable sites (pinvar) to 0.2; +G{0.9}
will fix the Gamma shape parameter (alpha) to 0.9; +I{0.2}+G{0.9}
will fix both pinvar and alpha. To fix the FreeRate model parameters, use the syntax +Rk{w1,r1,...,wk,rk}
(replacing k
with the number of categories). Here, w1, ..., wk
are the weights and r1, ..., rk
the rates for each category.
NOTE: For the
+G
model IQ-TREE implements the mean approximation approach (Yang, 1994). The same is done in RAxML and PhyML. However, some programs like TREE-PUZZLE implement the median approximation approach, which makes the resulting log-likelihood not comparable. IQ-TREE can change to this approach via the-gmedian
option.
Copyright (c) 2010-2016 IQ-TREE development team.
- First example
- Model selection
- New model selection
- Codon models
- Binary, Morphological, SNPs
- Ultrafast bootstrap
- Nonparametric bootstrap
- Single branch tests
- Partitioned analysis
- Partitioning with mixed data
- Partition scheme selection
- Bootstrapping partition model
- Utilizing multi-core CPUs
- Tree topology tests
- User-defined models
- Consensus construction and bootstrap value assignment
- Computing Robinson-Foulds distance
- Generating random trees
- DNA models
- Protein models
- Codon models
- Binary, morphological models
- Ascertainment bias correction
- Rate heterogeneity
- Counts files
- First running example
- Substitution models
- Virtual population size
- Sampling method
- Bootstrap branch support
- Interpretation of branch lengths