Scalability of large subsets #100

rvosa · 2024-05-02T10:36:13Z

In the current BOLD database, nearly all taxonomic families are small enough to be tractable for the tree inference engine, raxml. The rule of thumb that is followed, is that raxml ought to be able to handle up to ±10k input sequences. Since the tree searches are topologically constrained, the overall search space is significantly reduced, meaning that this may indeed be a realistic rule. Nevertheless, a small number of families (<10) exceed this size. These families have not been attempted yet. As a first step, this needs to be tested. If there are problems, these families will need to be partitioned further, e.g. down to the subfamily or genus level. This will have implications for the parallelization strategy (see #99).

This issue is considered 'done' when the BOLD 10M data set has been processed successfully.

rvosa · 2024-05-29T22:50:05Z

@AnnemiekeSchonthaler, could you attach the plots you generated for the Lepidoptera, showing the running times?

AnnemiekeSchonthaler · 2024-05-30T08:12:47Z

@rvosa I attached the plots in the doc folder on the lepidoptera branch.

rvosa added this to the Roadmap NLeSC/Naturalis collaboration milestone May 2, 2024

rvosa added this to BACTRIA moon shot May 29, 2024

rvosa moved this to Todo in BACTRIA moon shot May 29, 2024

rvosa assigned AnnemiekeSchonthaler May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability of large subsets #100

Scalability of large subsets #100

rvosa commented May 2, 2024 •

edited

Loading

rvosa commented May 29, 2024

AnnemiekeSchonthaler commented May 30, 2024

Scalability of large subsets #100

Scalability of large subsets #100

Comments

rvosa commented May 2, 2024 • edited Loading

rvosa commented May 29, 2024

AnnemiekeSchonthaler commented May 30, 2024

rvosa commented May 2, 2024 •

edited

Loading