diff --git a/src/guides/bioinformatics/index.rst b/src/guides/bioinformatics/index.rst index b5d4b751..3eb84899 100644 --- a/src/guides/bioinformatics/index.rst +++ b/src/guides/bioinformatics/index.rst @@ -14,6 +14,7 @@ How-to guides for performing bioinformatic anaylses with Nextstrain. augur_snakemake missing-sequence-data translate_ref + root-and-ref-seqs import-beast defining-clades colors diff --git a/src/guides/bioinformatics/root-and-ref-seqs.rst b/src/guides/bioinformatics/root-and-ref-seqs.rst new file mode 100644 index 00000000..cfdf42d6 --- /dev/null +++ b/src/guides/bioinformatics/root-and-ref-seqs.rst @@ -0,0 +1,41 @@ +========================================== +Understanding root and reference sequences +========================================== + +“Root” and “reference” sequences are components of Nextstrain trees that are sometimes confused because they play similar roles in phylogenetic analyses. The root node sequence is the sequence at the root node of the phylogeny, and it represents the most recent common ancestor of all samples in the tree (Fig. 1a). This sequence is inferred during phylogenetic analysis, and is not a sample in the dataset. In contrast, the reference sequence is the sequence against which all other samples in the tree are compared for genome alignment and annotation, and it is `chosen by the user `_ prior to phylogenetic analysis. Since the reference sequence is used for genome alignment and annotation, it will ideally be a high-quality genome assembly with accurate annotations. Since it is primarily used for alignment and annotation, it does not have to be included as a tip in the tree, although it could be. + +.. image :: ../../images/root-and-parent-node-trees.png + :alt: Phylogenetic trees illustrating root node and parent node + +One reason the terms “root” and “reference” can be confused stems from the common practice in phylogenetics of defining an outgroup to root a phylogeny. Since defining an outgroup and defining a reference both require the user to choose a particular genome sequence, the outgroup and reference could be misconstrued as being the same component of a tree, both related to rooting a tree. However, as stated above, the primary function of the reference is to provide a high quality sequence for alignment, and it is usually not the outgroup, and often is not a tip in the tree at all. + +Another reason the terms root and reference can be confused arises from the unique way in which reference sequences are used in Nextclade datasets, which are a particular feature of Nextstrain. One of the main functions of Nextclade datasets is to report mutations (SNPs and amino acid changes) that are present in new sequences provided by the user. Those mutations need to be reported relative to a particular sequence, which could potentially be the inferred root node sequence. However, the inferred root node sequence does not have carefully curated annotations, and in fact could have stop codons or amino acids that do not occur in real life. Thus, for Nextclade datasets, it is preferable to define mutations against a high-quality, annotated genome assembly, such as the reference sequence. However, reporting mutations relative to the reference sequence presents a programmatic challenge, since the reference sequence could be anywhere in the tree, or completely absent from the tree. Nextstrain addresses this challenge programmatically by introducing the idea of a “parent node” that is represented by the reference sequence and is placed on the tree as an ancestor of the root node (Fig. 1b). The placement of this parent node does not represent a real biological relationship, but provides a practical function: it enables programmatic encoding of the mutational differences between the reference sequence and the root node, thereby enabling mutation-calling against the reference sequence (rather than the root node sequence) for each sample sequence in the tree. + +In some sense, adding the reference sequence as a parent node to the tree makes the reference sequence “become” the root sequence, and that is terminology that is sometimes used in Nextstrain. Importantly, the parent node itself is not encoded in the tree; only the mutational differences between the root and parent nodes are encoded. Therefore the parent node will not be present when the tree is visualized in Auspice; instead, the mutational differences between the parent and root nodes will be reported as characteristics of the root node. + +What do I need to know about root and reference sequences if I'm making a Nextclade dataset? +============================================================================================ + +If you're making a standard `Nextclade dataset `_, you will need to add a reference sequence as a parent node to the tree you plan to use for the dataset. This can be accomplished by providing the reference sequence to ``augur ancestral`` using the ``--root-sequence`` option when creating the tree. This will effectively encode the mutational differences between the reference sequence and the root node in the tree structure. Notably, the parent node cannot be added after the Auspice JSON tree has been created for the tree. For example, if you use a tree that was created by someone else as an 'ad hoc' Nextclade dataset, you cannot add a new sequence as a parent node. + +If you're making a tree that you would like to use as an 'ad hoc' Nextclade dataset, adding a reference sequence as a parent node to the tree is optional. If you don't add a parent node, mutations in new sequences that are placed on the tree will be defined against the inferred root node sequence. However, it's important to note that an extra step is needed to provide the root (or parent) node sequence to the Auspice JSON of an 'ad hoc' Nextclade dataset, since this sequence is not included by default in Auspice JSON files. This sequence can be provided in one of the following ways: + +* Use the ``--include-root-sequence-inline`` option in ``augur export`` to add the root node sequence (or parent node sequence, if one was defined in ``augur ancestral``) to the end of the Auspice JSON. Alternatively, the ``--include-root-sequence`` option will output the same root (or parent) node sequence to a sidecar JSON file. + +* Provide the root (or parent) node sequence on the command line using the ``--input-ref`` option in ``nextclade run``. Note that this option only works on the command line and is not available for the drag-and-drop option for the Nextclade dataset. Make sure to provide the correct sequence here; if the tree has a parent node, then the parent node should be provided, whereas if the tree does not have a parent node, then the inferred root node sequence should be provided. The appropriate sequence can be obtained as described above in the previous bullet point. + + +What can go wrong? +================== + +When creating standard Nextclade datasets, a common mistake is forgetting to add the reference sequence to the tree as a parent node using the ``--root-sequence`` option in ``augur ancestral``. If you forget to do this, the mutational differences between the parent and root node sequences will not be encoded in the tree. Since Nextclade assumes that the reference sequence in your FASTA file is the parent node (which is effectively the root node of the encoded tree), the program will notice that the mutation calls on the branches don't match up with the reference sequence. This is because mutation calls include specific information about the reference sequence; they are defined in the format “T36G”, in which “T” is expected to be the nucleotide in the reference sequence at position 36. If those mutation calls don't match up with the reference sequence (e.g., the reference sequence does not have a “T” at position 36), Nextclade will produce an error message:: + + "Please check that your reference tree is consistent with your reference sequence." + +When creating 'ad hoc' Nextclade datasets, a common mistake is forgetting to add the root node sequence (or the parent node sequence, if one was added to the tree) to the end of the Auspice JSON using the ``--include-root-sequence-inline`` option in ``augur export`` as described above. This will result in the following error message when you try to use the tree as a Nextclade dataset:: + + "The current tree isn't usable as a Nextclade dataset as this dataset doesn't have a root-sequence." + +You will also get an error message if you try to define the root or parent node sequence on the command line, but mistakenly use the wrong sequence (for example, if you use the reference sequence when a parent node has not been defined, or you use the inferred root node sequence when a parent node has been defined). This will produce an error message:: + + "Please check that your reference tree is consistent with your reference sequence." diff --git a/src/images/root-and-parent-node-trees.png b/src/images/root-and-parent-node-trees.png new file mode 100644 index 00000000..536bd734 Binary files /dev/null and b/src/images/root-and-parent-node-trees.png differ