Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

establish whether duckdb is a viable replacement to sqlite #150

Open
GavinHuttley opened this issue Jul 30, 2024 · 0 comments
Open

establish whether duckdb is a viable replacement to sqlite #150

GavinHuttley opened this issue Jul 30, 2024 · 0 comments

Comments

@GavinHuttley
Copy link

GavinHuttley commented Jul 30, 2024

duckdb has many advantages, but my initial effort is at least 4x slower in just building the data base to represent genome annotation data.

The goal for this issue is to try and tune the database structure, SQL syntax and / or the python code for creating the database.

Select a single file set from http://ftp.ensembl.org/pub/release-112/gff3/homo_sapiens/Homo_sapiens.GRCh38.112.chromosome.<number or X/Y>.gff3.gz where "number or X/Y" is an integer from 1-22, X, Y. Pick a sample that exposes the performance difference but is not too big! As a guess, try the one for 16.

To address this issue you need to be working on the add-duckdb branch. The classes to be compared are ensembl_lite._genome.EnsemblGffDb (implemented using sqlite3) versus ensembl_lite._genome.EnsemblGffDuckDb.
The method for evaluating performance is present on both classes and is called add_records.

Make a benchmarking script based on the function ensembl_lite._genome.make_annotation_db(). This function takes both the input path of the gff3.gz file and the destination to write the output.

Some thoughts on why duckdb is slow

Here are a few thoughts on what I think is likely to contribute to the performance difference. The major one is that DuckDB's SQL syntax does not have an auto-increment keyword for primary keys. So to address this, my initial effort was based on using a method for grabbing the largest value in the ID column and incrementing that by one. SQL templates for these commands are in ensembl_lite._storage_mixin.

This strategy is clearly not going to scale well. I did make an initial attempt at having the EnsemblGffDuckDb class keep a record of the index and update it after adding every new record (see the usage of self._feature_id).

  • Perhaps this needs to be extended to the other key tables?
  • Perhaps it's not necessary to create these index columns directly and that DuckDB will do it automatically?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant