-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider how to load in BEAST trees with compact descriptions of a history #42
Comments
I just had a look, and noticed that the newicks exported by dendropy have commas in the extended newick field (for example, in
I've run into this before... Unfortunately, the ete newick parser treats all commas in a newick string as node separators, and has no concept of context for commas appearing in the extended newick field. The workaround is to convert these fields to a string format that's compatible with ete before the dendropy export. For example, if the node attribute
However, if you load the newick containing this annotation back into ete, the node attribute I think the easiest thing would be to add dendropy tree loading to historydag |
K, thanks for looking into this. We can put it on ice for the time being. |
#47 provides a generalized dag = hdag.history_dag_from_trees(
[tree.seed_node for tree in treelist],
[],
label_functions={'sequence': lambda node: node.annotations.get_value('history_all')},
child_node_func=dendropy.Node.child_nodes,
leaf_node_func=dendropy.Node.leaf_iter
) However, it looks to me like these are mutations relative to parent node, not relative to a fixed reference? If so, we're going to need the reference sequence, and probably also the complete leaf sequences, if they contain any ambiguities. I'll look into this more next week. |
GREAT!
These are indeed relative to parent.
|
Here's a working script to load these trees: import historydag as hdag
from functools import lru_cache
import dendropy
# dendropy doesn't parse nested lists correctly in metadata, so we load the
# trees with raw comment strings using `extract_comment_metadata`
dp_trees = dendropy.TreeList.get(
path="clade_13.GTR.history.trees", schema="nexus", extract_comment_metadata=False
)
# This should be replaced with the true reference genome.
reference_genome = "A" * 30000
reference_cg = hdag.compact_genome.CompactGenome({}, reference_genome)
def comment_parser(node_comments):
if len(node_comments) == 0:
yield from ()
return
elif len(node_comments) == 1:
comment_string = node_comments[0]
else:
raise ValueError("node_comments has more than one element" + str(node_comments))
if "history_all=" not in comment_string:
yield from ()
return
else:
mutations_string = comment_string.split("history_all=")[-1]
stripped_mutations_list = mutations_string[2:-2]
if stripped_mutations_list:
mutations_list = stripped_mutations_list.split("},{")
for mut in mutations_list:
try:
idx_str, _, from_base, to_base = mut.split(",")
except ValueError:
raise ValueError("comment_parser failed on: " + str(node_comments))
assert to_base in 'AGCT'
yield from_base + idx_str + to_base
else:
yield from ()
return
_comment_parser_test_data = [
(["&rate=1.0,mutations=0.0"], []),
(["&rate=1.0,mutations=0.0,history_all={}"], []),
(["&rate=1.0,mutations=0.0,history_all={{}}"], []),
(["&rate=1.0,mutations=0.0,history_all={{12,1.1,A,G}}"], ["A12G"]),
(
["&rate=1.0,mutations=0.0,history_all={{12,1.1,A,G},{1,0.99,A,C}}"],
["A12G", "A1C"],
),
]
for in_data, expected in _comment_parser_test_data:
parsed_data = list(comment_parser(in_data))
if parsed_data != expected:
raise RuntimeError(
f"comment_parser: unexpected output {parsed_data} from input: {in_data}. Expected {expected}"
)
@lru_cache(maxsize=(2 * len(dp_trees[0].nodes())))
def compute_cg(node):
if node.parent_node is None:
# base case: node is a root node
parent_cg_mut = reference_cg
else:
parent_cg_mut = compute_cg(node.parent_node)
return parent_cg_mut.apply_muts(comment_parser(node.comments))
dag = hdag.history_dag_from_trees(
[tree.seed_node for tree in dp_trees],
[],
label_functions={
"compact_genome": compute_cg,
},
attr_func=lambda n: {
"name": (n.taxon.label if n.is_leaf() else "internal"),
},
child_node_func=dendropy.Node.child_nodes,
leaf_node_func=dendropy.Node.leaf_iter,
)
dag = hdag.mutation_annotated_dag.CGHistoryDag.from_history_dag(dag)
dag.summary() The output from
I'll open a dendropy issue to get correct parsing of these newicks figured out. Note that just counting the mutations along all the branches of the beast trees gives wildly different (larger) parsimony scores, because there are lots of mutation reversions on single branches. |
Issue opened jeetsukumaran/DendroPy#145 |
Here's a modified script that fixes the reference sequence to be the sequence associated to the leaf import historydag as hdag
from functools import lru_cache
import dendropy
import xml.etree.ElementTree as ET
# get alignment from xml:
_etree = ET.parse('clade_13.GTR.xml')
_alignment = _etree.getroot().find('alignment')
fasta = {a[0].attrib['idref'].strip(): a[0].tail.strip() for a in _alignment}
# dendropy doesn't parse nested lists correctly in metadata, so we load the
# trees with raw comment strings using `extract_comment_metadata`
dp_trees = dendropy.TreeList.get(
path="clade_13.GTR.history.trees", schema="nexus", extract_comment_metadata=False
)
# This should be replaced with the true reference genome.
reference_leaf = 'England/CAMC-F00E8F/2021|2021-01-07'
reference_leaf_sequence = fasta[reference_leaf]
reference_leaf_cg = hdag.compact_genome.CompactGenome({}, reference_leaf_sequence)
# ### Comment parser and testing
def comment_parser(node_comments):
if len(node_comments) == 0:
yield from ()
return
elif len(node_comments) == 1:
comment_string = node_comments[0]
else:
raise ValueError("node_comments has more than one element" + str(node_comments))
if "history_all=" not in comment_string:
yield from ()
return
else:
mutations_string = comment_string.split("history_all=")[-1]
stripped_mutations_list = mutations_string[2:-2]
if stripped_mutations_list:
mutations_list = stripped_mutations_list.split("},{")
for mut in mutations_list:
try:
idx_str, _, from_base, to_base = mut.split(",")
except ValueError:
raise ValueError("comment_parser failed on: " + str(node_comments))
assert to_base in 'AGCT'
yield from_base + idx_str + to_base
else:
yield from ()
return
_comment_parser_test_data = [
(["&rate=1.0,mutations=0.0"], []),
(["&rate=1.0,mutations=0.0,history_all={}"], []),
(["&rate=1.0,mutations=0.0,history_all={{}}"], []),
(["&rate=1.0,mutations=0.0,history_all={{12,1.1,A,G}}"], ["A12G"]),
(
["&rate=1.0,mutations=0.0,history_all={{12,1.1,A,G},{1,0.99,A,C}}"],
["A12G", "A1C"],
),
]
for in_data, expected in _comment_parser_test_data:
parsed_data = list(comment_parser(in_data))
if parsed_data != expected:
raise RuntimeError(
f"comment_parser: unexpected output {parsed_data} from input: {in_data}. Expected {expected}"
)
# ### End comment parser testing
def compute_cgs(tree):
ref_node = tree.find_node_with_taxon_label(reference_leaf)
ref_cg = reference_leaf_cg
while True:
ref_cg = ref_cg.apply_muts(list(comment_parser(ref_node.comments)), reverse=True)
ref_node = ref_node.parent_node
if ref_node is None:
break
@lru_cache(maxsize=(2 * len(dp_trees[0].nodes())))
def compute_cg(node):
if node.parent_node is None:
# base case: node is a root node
parent_cg_mut = ref_cg
else:
parent_cg_mut = compute_cg(node.parent_node)
return parent_cg_mut.apply_muts(comment_parser(node.comments))
for node in tree.preorder_node_iter():
node.cg = compute_cg(node)
for tree in dp_trees:
compute_cgs(tree)
dag = hdag.history_dag_from_trees(
[tree.seed_node for tree in dp_trees],
[],
label_functions={
"compact_genome": lambda n: n.cg,
},
attr_func=lambda n: {
"name": (n.taxon.label if n.is_leaf() else "internal"),
},
child_node_func=dendropy.Node.child_nodes,
leaf_node_func=dendropy.Node.leaf_iter,
)
dag = hdag.mutation_annotated_dag.CGHistoryDag.from_history_dag(dag)
dag.summary() |
^ My last comment has now been edited to load the alignment from the beast XML file, and use the appropriate one as the starting sequence for attempting to reconstruct the reference. More work needs to be done though, since all of the leaves are ambiguous, and are resolved differently in each tree. I haven't completely decided how to resolve this, but I'll come back to it tomorrow. |
I've included a generalized version of this, with fixes thanks to Jiansi, in the In addition to using the most recent state of that branch, you must have Example use, equivalent to the scripts above (with some additional fixes): import historydag
dag = historydag.beast_loader.dag_from_beast_trees('clade_13.GTR.xml', 'clade_13.GTR.history.trees')
dag.summary() Now the output looks like this:
So, we still have some different realizations of the same leaf, but not as bad as before. The key insight to correctly load these trees was that BEAST ignores sites that are all In addition, I remove sites which are all |
Thanks to @jsigao we now understand that BEAST can output compact history descriptions that read like
We'd like to be able to load these and make a history DAG of them.
Here is a zipped up file with the original
.trees
file from BEAST, and also a file that has those annotations converted to NHX like so:I then went in by hand and deleted the first comment on each line, which ete3 didn't like.
However, I wasn't able to get these immediately loaded into ete3 using something like
So I'm afraid this isn't the tidiest issue, but hopefully it's a start.
Before diving into any particular approach, let's keep the big picture in mind. We want to be able to load these trees into the hDAG. We could
The text was updated successfully, but these errors were encountered: