Support for leaf labels with ambiguities #38

marybarker · 2022-10-25T21:19:04Z

It seems likely that we can accommodate ambiguous bases in leaf labels using the mutation annotated dag/compact genome DAG structure, and do so without losing the parsimony criterion, as long as the internal node labels are fully unambiguous.

If the internal node labels (in compact genome format) are fully disambiguated, then for any tree sampled from the DAG, its parsimony score can be computed explicitly from the internal nodes, and minimized over the resolutions of ambiguous leaf labels.

We need to explicitly prove the ansatz that resolutions of leaf labels can be performed independently of each other, and here is a place to start that conversation.

willdumm · 2022-10-27T17:38:41Z

Let $h'$ be the minimum possible hamming distance on an edge (between two nodes with ambiguous sequences). Consider a history $t = (V_t, E_t)$ whose leaves have ambiguous sequence labels, but whose internal node sequence labels are unambiguous. We need to show that
$$\sum_{e\in E_t} h'(e) = \min_{t' \text{ a resolution of t }} \sum_{e\in E_{t'}} h(e)$$

However, in such a tree $t$, all nodes with ambiguous sequences have degree 1, and their unique parent has an unambiguous sequence. To minimize the sum on the right is to choose leaf sequences which minimizes $h$ on each pendant edge. Since each such choice of sequence changes the value of $h$ on any other edge, we have the equality above.

This $h'$ then works as an edge-weight function in the hDAG, with all the properties of normal hamming distance on an unambiguous dag, as long as all internal nodes in the DAG ( and therefore all internal nodes in any tree in the DAG) have unambiguous sequences.

willdumm · 2022-10-27T17:46:29Z

In historydag this will be provided by a subclass of HistoryDag expecting each node label to contain a leaf_ambiguous_sequence field, or something similar, which will be expected to contain no ambiguities unless it is on a leaf node.

We'll need a version of hamming distance which returns the minimum distance possible between two nodes.

Not sure if we'd ever want it, but we could also have a subclass that includes ambiguous sequences at every node, which would allow e.g. min-weight-ambiguous labeled trees to be used to create a DAG... I guess we need to think about how these things are related

In Larch, we need a way of keeping track of leaf IDs, so we can correctly replace the resolved sequences chosen for leaves by matOptimize with the original ambiguous sequences before merging back to the DAG. We also want to provide ambiguous sequences (in the internal representation for vcf data of usher) to matOptimize when optimizing a tree.

willdumm mentioned this issue Nov 10, 2022

Summary of goals matsengrp/larch#36

Open

willdumm self-assigned this Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for leaf labels with ambiguities #38

Support for leaf labels with ambiguities #38

marybarker commented Oct 25, 2022

willdumm commented Oct 27, 2022

willdumm commented Oct 27, 2022 •

edited

Loading

Support for leaf labels with ambiguities #38

Support for leaf labels with ambiguities #38

Comments

marybarker commented Oct 25, 2022

willdumm commented Oct 27, 2022

willdumm commented Oct 27, 2022 • edited Loading

willdumm commented Oct 27, 2022 •

edited

Loading