First pass at BEAST import. #11

jeromekelleher · 2020-05-26T16:05:52Z

An initial pass at doing BEAST tree imports. Unfortunately, it looks like node times are not maintained between adjacent trees and so we cannot turn it the input into a meaningful tree sequence (well, not a succinct one, anyway). We will always get a JBOT out of BEAST output as far as I can tell.

For example, I tried this code on the PRS.trees file I found on figshare. This has 429 leaves, and is quite a bit tree file. Typically, there's 10,000 trees in a BEAST analysis, which in this case gives a 407MB nexus file.

Running the conversion code on a 46MB subset of this, we get a 41MB .trees file (not including the "rates" annotations for nodes, or any other metadata; we should include the sample names, but the rates are probably irrelevant). So there's no real compression here. However, the load time is a massive improvement - parsing the trees file with dendropy is extremely slow (I tried to run on the full 407MB file, but my computer ran out of memory (16G) after about 10 minutes).

I'm not sure why we can't identify nodes by their times. In the fairly large case of 429 leaves it seems implausible that an MCMC would change every branch in every tree after 5000 samples. We're definitely not failing to identify nodes because of the inherent numerical precision crappiness of newick and representation of time by branch lengths, as I've experimented a bit with precision and even nodes that are close to the leaves will change time. In this example, I looked at a node that was the parent of two leaves in many adjacent trees, and its time varies slightly from tree to tree:

This clearly isn't a precision issue --- I think BEAST must be recomputing the branch lengths for each tree.

If we put in a really low precision then we can't identify nodes by their times, as we get lots of nodes with the same time in the same tree.

JereKoskela · 2020-05-26T19:04:02Z

There is a move in BEAST called "Internal node heights", which I think resamples the heights of all internal nodes and hence changes every branch length in one go if accepted. I think it also has quite a high default probability of taking place (BEAST randomises exactly which update takes place at each time step). See here: https://beast.community/first_tutorial#mcmc-operators

Addendum: The corollary is that if we wanted to store all output (as opposed to just very thinned output), then we wouldn't see a JBOT except when a global update like this gets accepted. So what you're doing could well be worthwhile if it enables a feasible reduction in thinning.

jeromekelleher · 2020-05-26T19:08:07Z

Aha, thanks @JereKoskela! That makes perfect sense now.

First pass at BEAST import.

0f9651e

winni2k mentioned this pull request Jun 23, 2020

networkx.DiGraph import #12

Open

benjeffery changed the base branch from master to main September 28, 2020 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First pass at BEAST import. #11

First pass at BEAST import. #11

jeromekelleher commented May 26, 2020 •

edited

Loading

JereKoskela commented May 26, 2020 •

edited

Loading

jeromekelleher commented May 26, 2020

First pass at BEAST import. #11

Are you sure you want to change the base?

First pass at BEAST import. #11

Conversation

jeromekelleher commented May 26, 2020 • edited Loading

JereKoskela commented May 26, 2020 • edited Loading

jeromekelleher commented May 26, 2020

jeromekelleher commented May 26, 2020 •

edited

Loading

JereKoskela commented May 26, 2020 •

edited

Loading