Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into different representation of sets of identifiers in database #82

Open
apetkau opened this issue Jul 29, 2021 · 0 comments
Open
Labels
data model The internal data model of the application.

Comments

@apetkau
Copy link
Owner

apetkau commented Jul 29, 2021

For storing sets of samples, I chose to use https://pypi.org/project/pyroaring/ to encode sets of integers (representing sample identifiers) and store in a table like:

1. Sets as binary blobs

Feature ID Sample Set (pyroaring)
feature1 {1, 5, 7}

This method will contain 1 row per feature (so rows grows as O(f) where f is the number of features). The number of rows is independent of the number of samples (other than the more complex relationship between number of features and number of samples).

2. Sets as rows in table

Another method to store this information would be to unroll the identifiers into rows:

Feature ID Sample ID
feature1 1
feature1 5
feature1 7

Here, the number of rows is 1 per feature/sample pair. That is rows grows as O(s * f) where f is the number of features and s is the number of samples.

Initial decision

Now, I originally chose method (2) since I was concerned about the number of rows I would produce and was getting some slower speeds in a simple test dataset I made (though it was a basic test and slow down could have been for other reasons).

As a rough estimate, if I wanted to store 2.5 million samples (current number of SARS-Cov-2 genomes in GISAID) with 30,000 features (~1 per bp), I would need 30,000 rows for option (1) but 2.5 million * 30,000 = 7.5 * 10^10 rows for option (2).

So, this made me go with option (1), which also has the advantage of letting me easily load up sets of samples by looking up a single feature/few features instead of long SQL queries.

Issue

However, option (2) has advantages too. In option (2) I can lookup which samples belong to a feature, or which features belong to a sample from the same table. In option (1) I need a separate table for storing this information.

Also, option (2) is relying a lot more on the built-in relational tables structure and SQL, whereas in option (1) I need to write some of that myself (e.g., insertions of new samples means I need to read the sets of samples from the database, join to new sample identifiers, and rewrite to my database in my own code).

I like the performance that option (1) is giving me, but at some point I may need to rethink using option (2).

@apetkau apetkau added the data model The internal data model of the application. label Aug 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data model The internal data model of the application.
Projects
None yet
Development

No branches or pull requests

1 participant