Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for leaf labels with ambiguities #38

Open
marybarker opened this issue Oct 25, 2022 · 2 comments
Open

Support for leaf labels with ambiguities #38

marybarker opened this issue Oct 25, 2022 · 2 comments
Assignees

Comments

@marybarker
Copy link
Contributor

It seems likely that we can accommodate ambiguous bases in leaf labels using the mutation annotated dag/compact genome DAG structure, and do so without losing the parsimony criterion, as long as the internal node labels are fully unambiguous.

If the internal node labels (in compact genome format) are fully disambiguated, then for any tree sampled from the DAG, its parsimony score can be computed explicitly from the internal nodes, and minimized over the resolutions of ambiguous leaf labels.

We need to explicitly prove the ansatz that resolutions of leaf labels can be performed independently of each other, and here is a place to start that conversation.

@willdumm
Copy link
Collaborator

Let $h'$ be the minimum possible hamming distance on an edge (between two nodes with ambiguous sequences). Consider a history $t = (V_t, E_t)$ whose leaves have ambiguous sequence labels, but whose internal node sequence labels are unambiguous. We need to show that
$$\sum_{e\in E_t} h'(e) = \min_{t' \text{ a resolution of t }} \sum_{e\in E_{t'}} h(e)$$

However, in such a tree $t$, all nodes with ambiguous sequences have degree 1, and their unique parent has an unambiguous sequence. To minimize the sum on the right is to choose leaf sequences which minimizes $h$ on each pendant edge. Since each such choice of sequence changes the value of $h$ on any other edge, we have the equality above.

This $h'$ then works as an edge-weight function in the hDAG, with all the properties of normal hamming distance on an unambiguous dag, as long as all internal nodes in the DAG ( and therefore all internal nodes in any tree in the DAG) have unambiguous sequences.

@willdumm
Copy link
Collaborator

willdumm commented Oct 27, 2022

In historydag this will be provided by a subclass of HistoryDag expecting each node label to contain a leaf_ambiguous_sequence field, or something similar, which will be expected to contain no ambiguities unless it is on a leaf node.

We'll need a version of hamming distance which returns the minimum distance possible between two nodes.

Not sure if we'd ever want it, but we could also have a subclass that includes ambiguous sequences at every node, which would allow e.g. min-weight-ambiguous labeled trees to be used to create a DAG... I guess we need to think about how these things are related

In Larch, we need a way of keeping track of leaf IDs, so we can correctly replace the resolved sequences chosen for leaves by matOptimize with the original ambiguous sequences before merging back to the DAG. We also want to provide ambiguous sequences (in the internal representation for vcf data of usher) to matOptimize when optimizing a tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants