-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate adding option for a duckdb backend or RSQLite+parquet #12
Comments
Some links Does SQLite cover all of OHDSI SQL? If not, where does it fall short? I'm wondering if duckdb would support all of OHDSI SQL if SQLite does not.
|
+1 |
Some simple join tests comparing duckdb and SQLite. According to the duckdb FAQ it has not been optimized for performance yet. # Test large joins -----------------
n_covariates <- 1000
df1 <- as.data.frame(lapply(1:n_covariates, function(n) rbinom(1e6, 1, .2)))
names(df1) <- paste0("covariate", 1:n_covariates)
df1 <- cbind(data.frame(id = 1:1e6), df1)
df2 <- data.frame(id = sample(1:1e6, 10, replace = T))
### duckdb --------------------------
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))
con <- duckdb::dbConnect(duckdb::duckdb(), file.path(getOption("andromedaTempFolder"), "test.duckdb"))
duckdb::dbWriteTable(con, "df1", df1)
duckdb::dbWriteTable(con, "df2", df2)
system.time({
DBI::dbExecute(con, "drop table if exists df3")
DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#> user system elapsed
#> 4.92 0.05 5.01
system.time({
DBI::dbExecute(con, "create index idx_df1_id on df1(id);")
DBI::dbExecute(con, "create index idx_df2_id on df2(id);")
})
#> user system elapsed
#> 0.40 0.01 0.45
system.time({
DBI::dbExecute(con, "drop table if exists df3")
DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#> user system elapsed
#> 2.06 0.05 2.19
duckdb::dbDisconnect(con, shutdown = TRUE)
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))
### sqlite ---------------
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))
conn <- RSQLite::dbConnect(RSQLite::SQLite(), file.path(getOption("andromedaTempFolder"), "test.sqlite"))
RSQLite::dbWriteTable(conn, "df1", df1)
RSQLite::dbWriteTable(conn, "df2", df2)
system.time({
RSQLite::dbExecute(conn, "drop table if exists df3")
RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#> user system elapsed
#> 0.48 1.73 2.50
system.time({
DBI::dbExecute(conn, "create index idx_df1_id on df1(id);")
DBI::dbExecute(conn, "create index idx_df2_id on df2(id);")
})
#> user system elapsed
#> 1.06 2.54 3.75
system.time({
RSQLite::dbExecute(conn, "drop table if exists df3")
RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#> user system elapsed
#> 0.45 2.37 2.89
DBI::dbDisconnect(conn)
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite")) Created on 2021-10-13 by the reprex package (v2.0.1) |
Thanks, this is very informative! There appears to be a large hit in performance when using DuckDB. And performance is an important aspect of Andromeda. |
So, after wrestling with SQLLite and their lack of full fledged date types, single-schema technology limitations, lack of 'real' temp tables, and more, @schuemie pointed me to duckdb and I also found this issue. I'm trying to create tests for CohortIncidence, and it is heavy on the date functions, and SqLite is failing hard. The DatabaseConnector attempts to work around it are valiant, but probably not sustainable. I'll be looking to use duckdb in my tests for CohortIncidence, and I'd like to know if it would be OK for me to adopt it and possibly introduce SqlRender rules for translation (although it doesn't seem like we'd need much). Performance is not as critical for me, so even if you wanted to keep Andromeda on SqLite, i'd still appreicate it if we worked to support duckdb as a DatabaseConnector option. |
The performance hit is unfortunate and surprising since duckdb is supposed to be designed for OLAP and should outperform sqlite in the long run. My guess is that it is because duckdb is still new and sqlite has had a long time to mature. Indexes or use of multi-threading might help. I'm very much in favor of starting to use it in some capacity (maybe for Eunomia) just so we can have date support. I agree the date workarounds are a loosing battle long term. Here is a related issue: OHDSI/DatabaseConnector#153 |
No, I just was hoping to use database connector but it depends on some attribute being present in order to use the query/execute commands. Also, I think there will be some light sqlrender dialect work...but I don't use andromeda directly for anything at the moment. |
We're goin with arrow! |
FYI: I reran the tests again to see if DuckDB has improved. It has, but SQLite is still the winner. (which is odd, because I get amazing performance out of DuckDB in Python): # Test large joins -----------------
n_covariates <- 1000
df1 <- as.data.frame(lapply(1:n_covariates, function(n) rbinom(1e6, 1, .2)))
names(df1) <- paste0("covariate", 1:n_covariates)
df1 <- cbind(data.frame(id = 1:1e6), df1)
df2 <- data.frame(id = sample(1:1e6, 10, replace = T))
### duckdb --------------------------
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))
con <- duckdb::dbConnect(duckdb::duckdb(), file.path(getOption("andromedaTempFolder"), "test.duckdb"))
duckdb::dbWriteTable(con, "df1", df1)
duckdb::dbWriteTable(con, "df2", df2)
system.time({
DBI::dbExecute(con, "drop table if exists df3")
DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user system elapsed
# 1.73 0.03 0.58
system.time({
DBI::dbExecute(con, "create index idx_df1_id on df1(id);")
DBI::dbExecute(con, "create index idx_df2_id on df2(id);")
})
# user system elapsed
# 0.19 0.01 0.17
system.time({
DBI::dbExecute(con, "drop table if exists df3")
DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user system elapsed
# 1.62 0.05 0.56
duckdb::dbDisconnect(con, shutdown = TRUE)
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))
### sqlite ---------------
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))
conn <- RSQLite::dbConnect(RSQLite::SQLite(), file.path(getOption("andromedaTempFolder"), "test.sqlite"))
RSQLite::dbWriteTable(conn, "df1", df1)
RSQLite::dbWriteTable(conn, "df2", df2)
system.time({
RSQLite::dbExecute(conn, "drop table if exists df3")
RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user system elapsed
# 0.11 1.11 1.25
system.time({
DBI::dbExecute(conn, "create index idx_df1_id on df1(id);")
DBI::dbExecute(conn, "create index idx_df2_id on df2(id);")
})
# user system elapsed
# 0.43 1.05 1.50
system.time({
RSQLite::dbExecute(conn, "drop table if exists df3")
RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user system elapsed
# 0.21 1.00 1.20
DBI::dbDisconnect(conn)
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite")) |
Good to know. Thanks Martijn. Kind of strange since I'd assume both R and python are calling the same C++ code. |
I reran the 'benchmark', but made the data a bit larger. If you look at the 'user' category of time, SQLite is still superior, but the 'elapsed' category tells a different story. I also timed the data insertion, and there duckdb is the undisputed winner. I'll look into switching the backend to duckdb. # Test large joins -----------------
options("andromedaTempFolder" ="d:/andromedaTemp")
n_covariates <- 1000
df1 <- as.data.frame(lapply(1:n_covariates, function(n) rbinom(1e6, 1, .2)))
names(df1) <- paste0("covariate", 1:n_covariates)
df1 <- cbind(data.frame(id = 1:1e7), df1)
df2 <- data.frame(id = sample(1:1e7, 10, replace = T))
### duckdb --------------------------
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))
con <- duckdb::dbConnect(duckdb::duckdb(), file.path(getOption("andromedaTempFolder"), "test.duckdb"))
system.time({
duckdb::dbWriteTable(con, "df1", df1)
duckdb::dbWriteTable(con, "df2", df2)
})
# user system elapsed
# 246.33 17.24 261.17
system.time({
DBI::dbExecute(con, "drop table if exists df3")
DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user system elapsed
# 14.11 0.22 1.70
system.time({
DBI::dbExecute(con, "create index idx_df1_id on df1(id);")
DBI::dbExecute(con, "create index idx_df2_id on df2(id);")
})
# user system elapsed
# 5.05 13.64 2.49
system.time({
DBI::dbExecute(con, "drop table if exists df3")
DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user system elapsed
# 14.48 0.06 1.64
duckdb::dbDisconnect(con, shutdown = TRUE)
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))
### sqlite ---------------
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))
conn <- RSQLite::dbConnect(RSQLite::SQLite(), file.path(getOption("andromedaTempFolder"), "test.sqlite"))
system.time({
RSQLite::dbWriteTable(conn, "df1", df1)
RSQLite::dbWriteTable(conn, "df2", df2)
})
# user system elapsed
# 467.03 17.60 485.41
system.time({
RSQLite::dbExecute(conn, "drop table if exists df3")
RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user system elapsed
# 1.63 6.39 8.02
system.time({
DBI::dbExecute(conn, "create index idx_df1_id on df1(id);")
DBI::dbExecute(conn, "create index idx_df2_id on df2(id);")
})
# user system elapsed
# 3.61 7.00 10.61
system.time({
RSQLite::dbExecute(conn, "drop table if exists df3")
RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user system elapsed
# 2.04 6.25 8.30
DBI::dbDisconnect(conn)
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite")) |
DuckDB is a file-based database similar to sqlite and has an R interface. It looks like it might provider performance benefits over SQLite.
https://duckdb.org/docs/api/r
Investigate performance differences between DuckDB and SQLite. Look into implementing an Andromeda DuckDB backend.
The text was updated successfully, but these errors were encountered: