Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an explanation of root and reference sequences #227

Merged
merged 1 commit into from
Sep 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/guides/bioinformatics/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ How-to guides for performing bioinformatic anaylses with Nextstrain.
augur_snakemake
missing-sequence-data
translate_ref
root-and-ref-seqs
import-beast
defining-clades
colors
Expand Down
41 changes: 41 additions & 0 deletions src/guides/bioinformatics/root-and-ref-seqs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
==========================================
Understanding root and reference sequences
==========================================

“Root” and “reference” sequences are components of Nextstrain trees that are sometimes confused because they play similar roles in phylogenetic analyses. The root node sequence is the sequence at the root node of the phylogeny, and it represents the most recent common ancestor of all samples in the tree (Fig. 1a). This sequence is inferred during phylogenetic analysis, and is not a sample in the dataset. In contrast, the reference sequence is the sequence against which all other samples in the tree are compared for genome alignment and annotation, and it is `chosen by the user <https://docs.nextstrain.org/en/latest/guides/bioinformatics/translate_ref.html>`_ prior to phylogenetic analysis. Since the reference sequence is used for genome alignment and annotation, it will ideally be a high-quality genome assembly with accurate annotations. Since it is primarily used for alignment and annotation, it does not have to be included as a tip in the tree, although it could be.

.. image :: ../../images/root-and-parent-node-trees.png
:alt: Phylogenetic trees illustrating root node and parent node

One reason the terms “root” and “reference” can be confused stems from the common practice in phylogenetics of defining an outgroup to root a phylogeny. Since defining an outgroup and defining a reference both require the user to choose a particular genome sequence, the outgroup and reference could be misconstrued as being the same component of a tree, both related to rooting a tree. However, as stated above, the primary function of the reference is to provide a high quality sequence for alignment, and it is usually not the outgroup, and often is not a tip in the tree at all.

Another reason the terms root and reference can be confused arises from the unique way in which reference sequences are used in Nextclade datasets, which are a particular feature of Nextstrain. One of the main functions of Nextclade datasets is to report mutations (SNPs and amino acid changes) that are present in new sequences provided by the user. Those mutations need to be reported relative to a particular sequence, which could potentially be the inferred root node sequence. However, the inferred root node sequence does not have carefully curated annotations, and in fact could have stop codons or amino acids that do not occur in real life. Thus, for Nextclade datasets, it is preferable to define mutations against a high-quality, annotated genome assembly, such as the reference sequence. However, reporting mutations relative to the reference sequence presents a programmatic challenge, since the reference sequence could be anywhere in the tree, or completely absent from the tree. Nextstrain addresses this challenge programmatically by introducing the idea of a “parent node” that is represented by the reference sequence and is placed on the tree as an ancestor of the root node (Fig. 1b). The placement of this parent node does not represent a real biological relationship, but provides a practical function: it enables programmatic encoding of the mutational differences between the reference sequence and the root node, thereby enabling mutation-calling against the reference sequence (rather than the root node sequence) for each sample sequence in the tree.

In some sense, adding the reference sequence as a parent node to the tree makes the reference sequence “become” the root sequence, and that is terminology that is sometimes used in Nextstrain. Importantly, the parent node itself is not encoded in the tree; only the mutational differences between the root and parent nodes are encoded. Therefore the parent node will not be present when the tree is visualized in Auspice; instead, the mutational differences between the parent and root nodes will be reported as characteristics of the root node.

What do I need to know about root and reference sequences if I'm making a Nextclade dataset?
============================================================================================

If you're making a standard `Nextclade dataset <https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html>`_, you will need to add a reference sequence as a parent node to the tree you plan to use for the dataset. This can be accomplished by providing the reference sequence to ``augur ancestral`` using the ``--root-sequence`` option when creating the tree. This will effectively encode the mutational differences between the reference sequence and the root node in the tree structure. Notably, the parent node cannot be added after the Auspice JSON tree has been created for the tree. For example, if you use a tree that was created by someone else as an 'ad hoc' Nextclade dataset, you cannot add a new sequence as a parent node.

If you're making a tree that you would like to use as an 'ad hoc' Nextclade dataset, adding a reference sequence as a parent node to the tree is optional. If you don't add a parent node, mutations in new sequences that are placed on the tree will be defined against the inferred root node sequence. However, it's important to note that an extra step is needed to provide the root (or parent) node sequence to the Auspice JSON of an 'ad hoc' Nextclade dataset, since this sequence is not included by default in Auspice JSON files. This sequence can be provided in one of the following ways:

* Use the ``--include-root-sequence-inline`` option in ``augur export`` to add the root node sequence (or parent node sequence, if one was defined in ``augur ancestral``) to the end of the Auspice JSON. Alternatively, the ``--include-root-sequence`` option will output the same root (or parent) node sequence to a sidecar JSON file.

* Provide the root (or parent) node sequence on the command line using the ``--input-ref`` option in ``nextclade run``. Note that this option only works on the command line and is not available for the drag-and-drop option for the Nextclade dataset. Make sure to provide the correct sequence here; if the tree has a parent node, then the parent node should be provided, whereas if the tree does not have a parent node, then the inferred root node sequence should be provided. The appropriate sequence can be obtained as described above in the previous bullet point.


What can go wrong?
==================

When creating standard Nextclade datasets, a common mistake is forgetting to add the reference sequence to the tree as a parent node using the ``--root-sequence`` option in ``augur ancestral``. If you forget to do this, the mutational differences between the parent and root node sequences will not be encoded in the tree. Since Nextclade assumes that the reference sequence in your FASTA file is the parent node (which is effectively the root node of the encoded tree), the program will notice that the mutation calls on the branches don't match up with the reference sequence. This is because mutation calls include specific information about the reference sequence; they are defined in the format “T36G”, in which “T” is expected to be the nucleotide in the reference sequence at position 36. If those mutation calls don't match up with the reference sequence (e.g., the reference sequence does not have a “T” at position 36), Nextclade will produce an error message::

"Please check that your reference tree is consistent with your reference sequence."

When creating 'ad hoc' Nextclade datasets, a common mistake is forgetting to add the root node sequence (or the parent node sequence, if one was added to the tree) to the end of the Auspice JSON using the ``--include-root-sequence-inline`` option in ``augur export`` as described above. This will result in the following error message when you try to use the tree as a Nextclade dataset::

"The current tree isn't usable as a Nextclade dataset as this dataset doesn't have a root-sequence."

You will also get an error message if you try to define the root or parent node sequence on the command line, but mistakenly use the wrong sequence (for example, if you use the reference sequence when a parent node has not been defined, or you use the inferred root node sequence when a parent node has been defined). This will produce an error message::

"Please check that your reference tree is consistent with your reference sequence."
Binary file added src/images/root-and-parent-node-trees.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading