You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To address this issue you need to be working on the add-duckdb branch. The classes to be compared are ensembl_lite._genome.EnsemblGffDb (implemented using sqlite3) versus ensembl_lite._genome.EnsemblGffDuckDb.
The method for evaluating performance is present on both classes and is called add_records.
Make a benchmarking script based on the function ensembl_lite._genome.make_annotation_db(). This function takes both the input path of the gff3.gz file and the destination to write the output.
Some thoughts on why duckdb is slow
Here are a few thoughts on what I think is likely to contribute to the performance difference. The major one is that DuckDB's SQL syntax does not have an auto-increment keyword for primary keys. So to address this, my initial effort was based on using a method for grabbing the largest value in the ID column and incrementing that by one. SQL templates for these commands are in ensembl_lite._storage_mixin.
This strategy is clearly not going to scale well. I did make an initial attempt at having the EnsemblGffDuckDb class keep a record of the index and update it after adding every new record (see the usage of self._feature_id).
Perhaps this needs to be extended to the other key tables?
Perhaps it's not necessary to create these index columns directly and that DuckDB will do it automatically?
The text was updated successfully, but these errors were encountered:
duckdb has many advantages, but my initial effort is at least 4x slower in just building the data base to represent genome annotation data.
The goal for this issue is to try and tune the database structure, SQL syntax and / or the python code for creating the database.
Select a single file set from http://ftp.ensembl.org/pub/release-112/gff3/homo_sapiens/Homo_sapiens.GRCh38.112.chromosome.<number or X/Y>.gff3.gz where "number or X/Y" is an integer from 1-22, X, Y. Pick a sample that exposes the performance difference but is not too big! As a guess, try the one for 16.
To address this issue you need to be working on the add-duckdb branch. The classes to be compared are
ensembl_lite._genome.EnsemblGffDb
(implemented using sqlite3) versusensembl_lite._genome.EnsemblGffDuckDb
.The method for evaluating performance is present on both classes and is called
add_records
.Make a benchmarking script based on the function
ensembl_lite._genome.make_annotation_db()
. This function takes both the input path of the gff3.gz file and the destination to write the output.Some thoughts on why duckdb is slow
Here are a few thoughts on what I think is likely to contribute to the performance difference. The major one is that DuckDB's SQL syntax does not have an auto-increment keyword for primary keys. So to address this, my initial effort was based on using a method for grabbing the largest value in the ID column and incrementing that by one. SQL templates for these commands are in
ensembl_lite._storage_mixin
.This strategy is clearly not going to scale well. I did make an initial attempt at having the EnsemblGffDuckDb class keep a record of the index and update it after adding every new record (see the usage of
self._feature_id
).The text was updated successfully, but these errors were encountered: