Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at BEAST import. #11

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jeromekelleher
Copy link
Member

@jeromekelleher jeromekelleher commented May 26, 2020

An initial pass at doing BEAST tree imports. Unfortunately, it looks like node times are not maintained between adjacent trees and so we cannot turn it the input into a meaningful tree sequence (well, not a succinct one, anyway). We will always get a JBOT out of BEAST output as far as I can tell.

For example, I tried this code on the PRS.trees file I found on figshare. This has 429 leaves, and is quite a bit tree file. Typically, there's 10,000 trees in a BEAST analysis, which in this case gives a 407MB nexus file.

Running the conversion code on a 46MB subset of this, we get a 41MB .trees file (not including the "rates" annotations for nodes, or any other metadata; we should include the sample names, but the rates are probably irrelevant). So there's no real compression here. However, the load time is a massive improvement - parsing the trees file with dendropy is extremely slow (I tried to run on the full 407MB file, but my computer ran out of memory (16G) after about 10 minutes).

I'm not sure why we can't identify nodes by their times. In the fairly large case of 429 leaves it seems implausible that an MCMC would change every branch in every tree after 5000 samples. We're definitely not failing to identify nodes because of the inherent numerical precision crappiness of newick and representation of time by branch lengths, as I've experimented a bit with precision and even nodes that are close to the leaves will change time. In this example, I looked at a node that was the parent of two leaves in many adjacent trees, and its time varies slightly from tree to tree:

0.03891914
0.03791369
0.03836194
0.03904479
0.0376319
0.03849151
0.03825644
0.03834921
0.03793981
0.03898491
0.03776027
0.03747061
0.03821779
0.03823619
0.03809714
0.03738423
0.03947856
0.03735052
0.03842629
0.03865787
0.03809055
0.04046528
0.04071309
0.0405558
0.04293515
0.04248632
0.04139696
0.04035231
0.04113374
0.04305416
0.04191715

This clearly isn't a precision issue --- I think BEAST must be recomputing the branch lengths for each tree.

If we put in a really low precision then we can't identify nodes by their times, as we get lots of nodes with the same time in the same tree.

@JereKoskela
Copy link
Member

JereKoskela commented May 26, 2020

There is a move in BEAST called "Internal node heights", which I think resamples the heights of all internal nodes and hence changes every branch length in one go if accepted. I think it also has quite a high default probability of taking place (BEAST randomises exactly which update takes place at each time step). See here: https://beast.community/first_tutorial#mcmc-operators

Addendum: The corollary is that if we wanted to store all output (as opposed to just very thinned output), then we wouldn't see a JBOT except when a global update like this gets accepted. So what you're doing could well be worthwhile if it enables a feasible reduction in thinning.

@jeromekelleher
Copy link
Member Author

Aha, thanks @JereKoskela! That makes perfect sense now.

@benjeffery benjeffery changed the base branch from master to main September 28, 2020 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants