establish whether duckdb is a viable replacement to sqlite #150

GavinHuttley · 2024-07-30T08:47:42Z

duckdb has many advantages, but my initial effort is at least 4x slower in just building the data base to represent genome annotation data.

The goal for this issue is to try and tune the database structure, SQL syntax and / or the python code for creating the database.

Select a single file set from http://ftp.ensembl.org/pub/release-112/gff3/homo_sapiens/Homo_sapiens.GRCh38.112.chromosome.<number or X/Y>.gff3.gz where "number or X/Y" is an integer from 1-22, X, Y. Pick a sample that exposes the performance difference but is not too big! As a guess, try the one for 16.

To address this issue you need to be working on the add-duckdb branch. The classes to be compared are ensembl_lite._genome.EnsemblGffDb (implemented using sqlite3) versus ensembl_lite._genome.EnsemblGffDuckDb.
The method for evaluating performance is present on both classes and is called add_records.

Make a benchmarking script based on the function ensembl_lite._genome.make_annotation_db(). This function takes both the input path of the gff3.gz file and the destination to write the output.

Some thoughts on why duckdb is slow

Here are a few thoughts on what I think is likely to contribute to the performance difference. The major one is that DuckDB's SQL syntax does not have an auto-increment keyword for primary keys. So to address this, my initial effort was based on using a method for grabbing the largest value in the ID column and incrementing that by one. SQL templates for these commands are in ensembl_lite._storage_mixin.

This strategy is clearly not going to scale well. I did make an initial attempt at having the EnsemblGffDuckDb class keep a record of the index and update it after adding every new record (see the usage of self._feature_id).

Perhaps this needs to be extended to the other key tables?
Perhaps it's not necessary to create these index columns directly and that DuckDB will do it automatically?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

establish whether duckdb is a viable replacement to sqlite #150

establish whether duckdb is a viable replacement to sqlite #150

GavinHuttley commented Jul 30, 2024 •

edited

Loading

establish whether duckdb is a viable replacement to sqlite #150

establish whether duckdb is a viable replacement to sqlite #150

Comments

GavinHuttley commented Jul 30, 2024 • edited Loading

Some thoughts on why duckdb is slow

GavinHuttley commented Jul 30, 2024 •

edited

Loading