You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For storing sets of samples, I chose to use https://pypi.org/project/pyroaring/ to encode sets of integers (representing sample identifiers) and store in a table like:
1. Sets as binary blobs
Feature ID
Sample Set (pyroaring)
feature1
{1, 5, 7}
This method will contain 1 row per feature (so rows grows as O(f) where f is the number of features). The number of rows is independent of the number of samples (other than the more complex relationship between number of features and number of samples).
2. Sets as rows in table
Another method to store this information would be to unroll the identifiers into rows:
Feature ID
Sample ID
feature1
1
feature1
5
feature1
7
Here, the number of rows is 1 per feature/sample pair. That is rows grows as O(s * f) where f is the number of features and s is the number of samples.
Initial decision
Now, I originally chose method (2) since I was concerned about the number of rows I would produce and was getting some slower speeds in a simple test dataset I made (though it was a basic test and slow down could have been for other reasons).
As a rough estimate, if I wanted to store 2.5 million samples (current number of SARS-Cov-2 genomes in GISAID) with 30,000 features (~1 per bp), I would need 30,000 rows for option (1) but 2.5 million * 30,000 = 7.5 * 10^10 rows for option (2).
So, this made me go with option (1), which also has the advantage of letting me easily load up sets of samples by looking up a single feature/few features instead of long SQL queries.
Issue
However, option (2) has advantages too. In option (2) I can lookup which samples belong to a feature, or which features belong to a sample from the same table. In option (1) I need a separate table for storing this information.
Also, option (2) is relying a lot more on the built-in relational tables structure and SQL, whereas in option (1) I need to write some of that myself (e.g., insertions of new samples means I need to read the sets of samples from the database, join to new sample identifiers, and rewrite to my database in my own code).
I like the performance that option (1) is giving me, but at some point I may need to rethink using option (2).
The text was updated successfully, but these errors were encountered:
For storing sets of samples, I chose to use https://pypi.org/project/pyroaring/ to encode sets of integers (representing sample identifiers) and store in a table like:
1. Sets as binary blobs
This method will contain 1 row per feature (so rows grows as
O(f)
wheref
is the number of features). The number of rows is independent of the number of samples (other than the more complex relationship between number of features and number of samples).2. Sets as rows in table
Another method to store this information would be to unroll the identifiers into rows:
Here, the number of rows is 1 per feature/sample pair. That is rows grows as
O(s * f)
wheref
is the number of features ands
is the number of samples.Initial decision
Now, I originally chose method (2) since I was concerned about the number of rows I would produce and was getting some slower speeds in a simple test dataset I made (though it was a basic test and slow down could have been for other reasons).
As a rough estimate, if I wanted to store 2.5 million samples (current number of SARS-Cov-2 genomes in GISAID) with 30,000 features (~1 per bp), I would need 30,000 rows for option (1) but 2.5 million * 30,000 = 7.5 * 10^10 rows for option (2).
So, this made me go with option (1), which also has the advantage of letting me easily load up sets of samples by looking up a single feature/few features instead of long SQL queries.
Issue
However, option (2) has advantages too. In option (2) I can lookup which samples belong to a feature, or which features belong to a sample from the same table. In option (1) I need a separate table for storing this information.
Also, option (2) is relying a lot more on the built-in relational tables structure and SQL, whereas in option (1) I need to write some of that myself (e.g., insertions of new samples means I need to read the sets of samples from the database, join to new sample identifiers, and rewrite to my database in my own code).
I like the performance that option (1) is giving me, but at some point I may need to rethink using option (2).
The text was updated successfully, but these errors were encountered: