Look into different representation of sets of identifiers in database #82

apetkau · 2021-07-29T16:12:49Z

For storing sets of samples, I chose to use https://pypi.org/project/pyroaring/ to encode sets of integers (representing sample identifiers) and store in a table like:

1. Sets as binary blobs

Feature ID	Sample Set (pyroaring)
feature1	{1, 5, 7}

This method will contain 1 row per feature (so rows grows as O(f) where f is the number of features). The number of rows is independent of the number of samples (other than the more complex relationship between number of features and number of samples).

2. Sets as rows in table

Another method to store this information would be to unroll the identifiers into rows:

Feature ID	Sample ID
feature1	1
feature1	5
feature1	7

Here, the number of rows is 1 per feature/sample pair. That is rows grows as O(s * f) where f is the number of features and s is the number of samples.

Initial decision

Now, I originally chose method (2) since I was concerned about the number of rows I would produce and was getting some slower speeds in a simple test dataset I made (though it was a basic test and slow down could have been for other reasons).

As a rough estimate, if I wanted to store 2.5 million samples (current number of SARS-Cov-2 genomes in GISAID) with 30,000 features (~1 per bp), I would need 30,000 rows for option (1) but 2.5 million * 30,000 = 7.5 * 10^10 rows for option (2).

So, this made me go with option (1), which also has the advantage of letting me easily load up sets of samples by looking up a single feature/few features instead of long SQL queries.

Issue

However, option (2) has advantages too. In option (2) I can lookup which samples belong to a feature, or which features belong to a sample from the same table. In option (1) I need a separate table for storing this information.

Also, option (2) is relying a lot more on the built-in relational tables structure and SQL, whereas in option (1) I need to write some of that myself (e.g., insertions of new samples means I need to read the sets of samples from the database, join to new sample identifiers, and rewrite to my database in my own code).

I like the performance that option (1) is giving me, but at some point I may need to rethink using option (2).

The text was updated successfully, but these errors were encountered:

apetkau added the data model The internal data model of the application. label Aug 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look into different representation of sets of identifiers in database #82

Look into different representation of sets of identifiers in database #82

apetkau commented Jul 29, 2021

Look into different representation of sets of identifiers in database #82

Look into different representation of sets of identifiers in database #82

Comments

apetkau commented Jul 29, 2021

1. Sets as binary blobs

2. Sets as rows in table

Initial decision

Issue