Break tsinfer polytomies using mutational density #363

hyanwong · 2024-01-11T22:40:46Z

If we had a single dated tree with a polytomy and mutations on the branches, we could probably use the relative rate of mutation along the branches to break the polytomy. E.g. in a polytomy like the following, it would probably be reasonable to split the polytomy by grouping together 2 and 3
. Or more formally. we could probably calculate the mutational likelihood of the 15 possible bifurcating topologies and pick the most likely.

Another more heuristic approach would be to calculate the average mutational rate between each pair of tips (i.e. the number of mutations between them divided by the branch-length distance), and then do something like UPGMA on the pairwise rates.

It's unclear to me how to extend this to a polytomy in a tree sequence, which could have different numbers of child edges coming and going. But generally, you could imagine the children of a polytomy "pulling" or "pushing" the parent node to differing strengths, measured by the difference between the parent & child posterior distributions and the edge mutational area separating them. It should be possible to work out the bets way to resolve this tension by creating new internal nodes that break the polytomy.

nspope · 2024-01-12T04:47:27Z

I can think of a simple model-based way to approach this, using EP and the existing variational approximation:

Consider a node $k$. If $k$ has two children, skip it. Otherwise, introduce a hypothetical node $j$ that is an "unobserved" child of $k$.
Assume that the only parent of $j$ is $k$. This makes the problem local.
Use a 2-state mixture model, where "observations" are the edges beneath $k$, and hidden states represent (A) edges where $k$ is the parent; (B) edges where $j$ is the parent.
This could be solved by EP: iterate over edges that are beneath $k$. For each edge, the EP update needs to sum over the two possibilities, while marginalizing out parent/child node ages. This is pretty similar to the current scheme, and requires an additional parameter per edge, call it $\rho_i$ for the $i$th edge, that is the probability that $j$ is the parent.
On convergence, use some threshold $p$ to sort edges. Edges where $\rho_i < p$ are kept with $k$ as the parent; edges where $\rho_i > p$ are assigned a new parent which is $j$. A new edge with 0 mutations is added between $k$ and $j$, with a span that is the union of the span of all $j$'s children.
Repeat steps 1-5 on $j$ rather than $k$ (e.g. with the subset of edges that were assigned to $j$ on the last go-round). And so on, until the polytomy is "resolved" and the remaining edges all have $\rho_i < p$.
Because the problem is local, we don't have to rebuild/resort the tree sequence after each node is visited: we can do a single pass through the tree sequence to find polytomies, then resolve each polytomy separately (which could be parallelized), then modify the tree sequence to introduce the new daughter nodes/edges.

This wouldn't resolve polytomies into binary trees -- instead, it'd resolve polytomies into a set of multiple, smaller polytomies. In the pic you included, we'd probably end up with something like ((2, 3), 0, 1) so that node 4 has three children instead of four. But this could potentially help with reducing inconsistencies that impact dating (and it'd be scalable).

Do you have an algorithmic method for introducing polytomies, @hyanwong? If we have a small test problem (a mini-tree-sequence with a single non-trivial polytomy that involves 10 nodes or so), then I can try to implement the above and see if it'd work

hyanwong · 2024-01-12T13:38:18Z

Really neat! I'm unclear how this would deal with edges that span a single parent node but where that node has different numbers of children in different regions of the genome. E.g. what happens if we have a node with 3 children at one end of the genome, 2 in the middle, and 3 again at the end?

Re algorithmic methods for generating polytomies: no, but I guess it would be quite easy to write one. I'll have a ply. We do have a lot of inferred ones with polytomies, of course, but you are right that knowing the ground truth would be much better.

nspope · 2024-01-12T18:56:39Z

I'm unclear how this would deal with edges that span a single parent node but where that node has different numbers of children in different regions of the genome. E.g. what happens if we have a node with 3 children at one end of the genome, 2 in the middle, and 3 again at the end?

I'm probably underestimating how complicated this is, but I was thinking of ignoring position entirely. For example, the node you describe could have sequence of children (A, B, C) -> (B, D) -> (A, C, D). The algorithm I describe would "cluster" A,B,C,D into two groups, and assign the latter group a new node as a parent. Call the new node E, and say B,D get assigned to it. Then the sequence of children for the original node would be (A, E, C) -> (E) -> (A, C, E). And the sequence of children for E (over the same spans) would be B -> (B, D) -> D. I think this always works out, because if we assume that E has a single parent (the original node) then the transmission paths are the same.

hyanwong · 2024-01-13T13:07:34Z

I've just put a naive method for collapsing bifurcations into polytomies at tskit-dev/tskit#2885

I guess this could be used to test if we can recover extra topology from mutation density.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Break tsinfer polytomies using mutational density #363

Break tsinfer polytomies using mutational density #363

hyanwong commented Jan 11, 2024 •

edited

Loading

nspope commented Jan 12, 2024 •

edited

Loading

hyanwong commented Jan 12, 2024

nspope commented Jan 12, 2024 •

edited

Loading

hyanwong commented Jan 13, 2024

Break tsinfer polytomies using mutational density #363

Break tsinfer polytomies using mutational density #363

Comments

hyanwong commented Jan 11, 2024 • edited Loading

nspope commented Jan 12, 2024 • edited Loading

hyanwong commented Jan 12, 2024

nspope commented Jan 12, 2024 • edited Loading

hyanwong commented Jan 13, 2024

hyanwong commented Jan 11, 2024 •

edited

Loading

nspope commented Jan 12, 2024 •

edited

Loading

nspope commented Jan 12, 2024 •

edited

Loading