Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
An initial pass at doing BEAST tree imports. Unfortunately, it looks like node times are not maintained between adjacent trees and so we cannot turn it the input into a meaningful tree sequence (well, not a succinct one, anyway). We will always get a JBOT out of BEAST output as far as I can tell.
For example, I tried this code on the PRS.trees file I found on figshare. This has 429 leaves, and is quite a bit tree file. Typically, there's 10,000 trees in a BEAST analysis, which in this case gives a 407MB nexus file.
Running the conversion code on a 46MB subset of this, we get a 41MB .trees file (not including the "rates" annotations for nodes, or any other metadata; we should include the sample names, but the rates are probably irrelevant). So there's no real compression here. However, the load time is a massive improvement - parsing the trees file with dendropy is extremely slow (I tried to run on the full 407MB file, but my computer ran out of memory (16G) after about 10 minutes).
I'm not sure why we can't identify nodes by their times. In the fairly large case of 429 leaves it seems implausible that an MCMC would change every branch in every tree after 5000 samples. We're definitely not failing to identify nodes because of the inherent numerical precision crappiness of newick and representation of time by branch lengths, as I've experimented a bit with precision and even nodes that are close to the leaves will change time. In this example, I looked at a node that was the parent of two leaves in many adjacent trees, and its time varies slightly from tree to tree:
This clearly isn't a precision issue --- I think BEAST must be recomputing the branch lengths for each tree.
If we put in a really low precision then we can't identify nodes by their times, as we get lots of nodes with the same time in the same tree.