-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In-memory data structures and trees API #2
Comments
@molpopgen, do you think my statement about the struct-of-arrays quintuply linked tree model being more efficient than standard linked tree structures is true, at least for naive implementations? I'm imaging comparing against a few other implementations:
My intuition would be that for large trees we should see performance for (say) the parsimony algorithm implemented with these different ways increase - I may be entirely wrong though! It's probably a good idea to keep parsimony as a running real-world example. Any thoughts? |
So, if your hypothetical linked-list of trees has the form, The main way to improve the current tskit setup, I think, may be to collect the "right" parts of those lists into structs, and have some smaller number of arrays of structs. Any time you access more than one feature of node Almost as if by magic, an array-of-structs implementation of this stuff is out there because I was curious myself... |
I guess my goal here is to convince people that the way we're laying things out in memory is a good idea, and I want to challenge latent assumptions that linked structures in memory are the only/best way to do trees. I'm sure there's lots of people out there who automatically assume a simple C++ linked object representation of a tree is the best possible way of doing things (by virtue of being written in C++!). So, I'm not really bothered about improving tskit, or about having anything that could realistically work well in the context of having the tree changing as we go along the genome. This is just a single tree - how well does tskit stack up against textbook implementations? |
Yeah, I'm on board here--my point was more that I think the current way must be quite close to optimal, and that improving is tough. |
Yep, agreed. Hopefully no one thinks that pointer-based linked lists would be the way to go anymore, but who knows. Edit: deleted my idea, as it is "changing along the genome"... |
Something like this is what I had in mind @molpopgen: from __future__ import annotations
import dataclasses
from typing import List
import msprime
@dataclasses.dataclass
class Node:
time: float
flags: int
parent: Node | None = None
children: List[Node] = dataclasses.field(default_factory=list)
ts = msprime.sim_ancestry(5, random_seed=234)
tree = ts.first()
object_map = {}
for u in tree.nodes(order="postorder"):
tsk_node = ts.node(u)
node_obj = Node(tsk_node.time, tsk_node.flags)
object_map[u] = node_obj
for v in tree.children(u):
child_obj = object_map[v]
child_obj.parent = node_obj
node_obj.children.append(child_obj)
root = node_obj
print(root) I'd imagine the starting point would be an idiomatic modern C++ version of this, and then we'd sequentially make it less idiomatic to gain performance (presumably). |
So, if I'm following you here, we need to:
I don't have a good sense of how current code bases are doing 1 in practice. I certainly hope no one is using the built-in linked list at this point in time--its performance impacts should be well-known by now. |
I don't think the textbook pointer-based tree structure is a straw-man @molpopgen - have a look at the Usher code for example. In the first instance, what I'd like to do is this:
I'm happy to do most of the work on this, but it would be super helpful if you could set up the C++ infrastructure needed to do it. I guess we can set tskit as a git submodule here? The goal here is to analyse the effect of different ways of laying out tree structures in memory when we have BIG trees on a real world application. Parsimony seems like a good example, since it requires pre and postorder traversal, and has a fairly typical dynamic programming structure. |
I just read the usher header and I see what you mean. I'll put something together. |
See also #25 |
Following on from the section describing the "on file" data entities in the model, I think we should have a section describing the "in memory" data structures.
This is the quintuply linked tree, which we should explain with an example.
Some things to discuss:
The text was updated successfully, but these errors were encountered: