Investigate adding option for a duckdb backend or RSQLite+parquet #12

ablack3 · 2021-04-08T17:33:36Z

DuckDB is a file-based database similar to sqlite and has an R interface. It looks like it might provider performance benefits over SQLite.

https://duckdb.org/docs/api/r

Investigate performance differences between DuckDB and SQLite. Look into implementing an Andromeda DuckDB backend.

ablack3 · 2021-04-13T12:10:09Z

Some links
Add parquet support to RSQLite: r-dbi/RSQLite#325
SQLite parquet virtual table: https://github.com/cldellow/sqlite-parquet-vtable
Why would this be great? https://cldellow.com/2018/06/22/sqlite-parquet-vtable.html
Maybe duckdb would be easier and better route? https://duckdb.org/docs/api/r

Does SQLite cover all of OHDSI SQL? If not, where does it fall short? I'm wondering if duckdb would support all of OHDSI SQL if SQLite does not.

update - found this on the forums Sounds like the most recent version of RSQLite does support all of OHDSI SQL.

vojtechhuser · 2021-07-07T19:39:54Z

+1

ablack3 · 2021-10-14T03:12:20Z

Some simple join tests comparing duckdb and SQLite. According to the duckdb FAQ it has not been optimized for performance yet.

# Test large joins -----------------

n_covariates <- 1000
df1 <- as.data.frame(lapply(1:n_covariates, function(n) rbinom(1e6, 1, .2)))
names(df1) <- paste0("covariate", 1:n_covariates)
df1 <- cbind(data.frame(id = 1:1e6), df1)
df2 <- data.frame(id = sample(1:1e6, 10, replace = T))

### duckdb --------------------------
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))
con <- duckdb::dbConnect(duckdb::duckdb(), file.path(getOption("andromedaTempFolder"), "test.duckdb"))

duckdb::dbWriteTable(con, "df1", df1)
duckdb::dbWriteTable(con, "df2", df2)

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#>    user  system elapsed 
#>    4.92    0.05    5.01

system.time({
  DBI::dbExecute(con, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(con, "create index idx_df2_id on df2(id);")
})
#>    user  system elapsed 
#>    0.40    0.01    0.45

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#>    user  system elapsed 
#>    2.06    0.05    2.19

duckdb::dbDisconnect(con, shutdown = TRUE)
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))

### sqlite ---------------
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))
conn <- RSQLite::dbConnect(RSQLite::SQLite(), file.path(getOption("andromedaTempFolder"), "test.sqlite"))

RSQLite::dbWriteTable(conn, "df1", df1)
RSQLite::dbWriteTable(conn, "df2", df2)

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#>    user  system elapsed 
#>    0.48    1.73    2.50

system.time({
  DBI::dbExecute(conn, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(conn, "create index idx_df2_id on df2(id);")
})
#>    user  system elapsed 
#>    1.06    2.54    3.75

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#>    user  system elapsed 
#>    0.45    2.37    2.89

DBI::dbDisconnect(conn)
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))

^{Created on 2021-10-13 by the reprex package (v2.0.1)}

schuemie · 2021-10-14T05:02:24Z

Thanks, this is very informative! There appears to be a large hit in performance when using DuckDB. And performance is an important aspect of Andromeda.

chrisknoll · 2022-04-13T06:06:02Z

So, after wrestling with SQLLite and their lack of full fledged date types, single-schema technology limitations, lack of 'real' temp tables, and more, @schuemie pointed me to duckdb and I also found this issue.

I'm trying to create tests for CohortIncidence, and it is heavy on the date functions, and SqLite is failing hard. The DatabaseConnector attempts to work around it are valiant, but probably not sustainable.

I'll be looking to use duckdb in my tests for CohortIncidence, and I'd like to know if it would be OK for me to adopt it and possibly introduce SqlRender rules for translation (although it doesn't seem like we'd need much). Performance is not as critical for me, so even if you wanted to keep Andromeda on SqLite, i'd still appreicate it if we worked to support duckdb as a DatabaseConnector option.

ablack3 · 2022-04-13T22:13:12Z

The performance hit is unfortunate and surprising since duckdb is supposed to be designed for OLAP and should outperform sqlite in the long run. My guess is that it is because duckdb is still new and sqlite has had a long time to mature. Indexes or use of multi-threading might help. I'm very much in favor of starting to use it in some capacity (maybe for Eunomia) just so we can have date support. I agree the date workarounds are a loosing battle long term.

Here is a related issue: OHDSI/DatabaseConnector#153
@chrisknoll - do you need any changes in Andromeda? I think I have a working Andromeda branch that uses duckdb.

chrisknoll · 2022-04-14T02:19:19Z

No, I just was hoping to use database connector but it depends on some attribute being present in order to use the query/execute commands. Also, I think there will be some light sqlrender dialect work...but I don't use andromeda directly for anything at the moment.

ablack3 · 2022-11-20T22:18:08Z

We're goin with arrow!

schuemie · 2023-11-06T09:33:57Z

FYI: I reran the tests again to see if DuckDB has improved. It has, but SQLite is still the winner. (which is odd, because I get amazing performance out of DuckDB in Python):

# Test large joins -----------------

n_covariates <- 1000
df1 <- as.data.frame(lapply(1:n_covariates, function(n) rbinom(1e6, 1, .2)))
names(df1) <- paste0("covariate", 1:n_covariates)
df1 <- cbind(data.frame(id = 1:1e6), df1)
df2 <- data.frame(id = sample(1:1e6, 10, replace = T))

### duckdb --------------------------
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))
con <- duckdb::dbConnect(duckdb::duckdb(), file.path(getOption("andromedaTempFolder"), "test.duckdb"))

duckdb::dbWriteTable(con, "df1", df1)
duckdb::dbWriteTable(con, "df2", df2)

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 1.73    0.03    0.58 

system.time({
  DBI::dbExecute(con, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(con, "create index idx_df2_id on df2(id);")
})
# user  system elapsed 
# 0.19    0.01    0.17 

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 1.62    0.05    0.56 

duckdb::dbDisconnect(con, shutdown = TRUE)
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))

### sqlite ---------------
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))
conn <- RSQLite::dbConnect(RSQLite::SQLite(), file.path(getOption("andromedaTempFolder"), "test.sqlite"))

RSQLite::dbWriteTable(conn, "df1", df1)
RSQLite::dbWriteTable(conn, "df2", df2)

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 0.11    1.11    1.25 

system.time({
  DBI::dbExecute(conn, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(conn, "create index idx_df2_id on df2(id);")
})
# user  system elapsed 
# 0.43    1.05    1.50 

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 0.21    1.00    1.20 

DBI::dbDisconnect(conn)
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))

ablack3 · 2023-11-06T19:22:10Z

Good to know. Thanks Martijn. Kind of strange since I'd assume both R and python are calling the same C++ code.

schuemie · 2024-05-27T08:13:41Z

I reran the 'benchmark', but made the data a bit larger. If you look at the 'user' category of time, SQLite is still superior, but the 'elapsed' category tells a different story. I also timed the data insertion, and there duckdb is the undisputed winner. I'll look into switching the backend to duckdb.

# Test large joins -----------------
options("andromedaTempFolder" ="d:/andromedaTemp")

n_covariates <- 1000
df1 <- as.data.frame(lapply(1:n_covariates, function(n) rbinom(1e6, 1, .2)))
names(df1) <- paste0("covariate", 1:n_covariates)
df1 <- cbind(data.frame(id = 1:1e7), df1)
df2 <- data.frame(id = sample(1:1e7, 10, replace = T))

### duckdb --------------------------
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))
con <- duckdb::dbConnect(duckdb::duckdb(), file.path(getOption("andromedaTempFolder"), "test.duckdb"))

system.time({
  duckdb::dbWriteTable(con, "df1", df1)
  duckdb::dbWriteTable(con, "df2", df2)
})
# user  system elapsed 
# 246.33   17.24  261.17 

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 14.11    0.22    1.70 

system.time({
  DBI::dbExecute(con, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(con, "create index idx_df2_id on df2(id);")
})
# user  system elapsed 
# 5.05   13.64    2.49 

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 14.48    0.06    1.64 

duckdb::dbDisconnect(con, shutdown = TRUE)
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))

### sqlite ---------------
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))
conn <- RSQLite::dbConnect(RSQLite::SQLite(), file.path(getOption("andromedaTempFolder"), "test.sqlite"))

system.time({
RSQLite::dbWriteTable(conn, "df1", df1)
RSQLite::dbWriteTable(conn, "df2", df2)
})
# user  system elapsed 
# 467.03   17.60  485.41 

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 1.63    6.39    8.02 

system.time({
  DBI::dbExecute(conn, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(conn, "create index idx_df2_id on df2(id);")
})
# user  system elapsed 
# 3.61    7.00   10.61 

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 2.04    6.25    8.30 

DBI::dbDisconnect(conn)
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))

ablack3 self-assigned this Apr 8, 2021

ablack3 added the enhancement New feature or request label Apr 8, 2021

ablack3 changed the title ~~Investigate adding option for a duckdb backend~~ Investigate adding option for a duckdb backend or RSQLite+parquet Apr 13, 2021

ablack3 mentioned this issue Oct 14, 2021

Date is changed to integer on selecting using dplyr #28

Open

ablack3 closed this as completed Nov 20, 2022

schuemie reopened this May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate adding option for a duckdb backend or RSQLite+parquet #12

Investigate adding option for a duckdb backend or RSQLite+parquet #12

ablack3 commented Apr 8, 2021

ablack3 commented Apr 13, 2021 •

edited

Loading

vojtechhuser commented Jul 7, 2021

ablack3 commented Oct 14, 2021

schuemie commented Oct 14, 2021

chrisknoll commented Apr 13, 2022

ablack3 commented Apr 13, 2022 •

edited

Loading

chrisknoll commented Apr 14, 2022

ablack3 commented Nov 20, 2022

schuemie commented Nov 6, 2023

ablack3 commented Nov 6, 2023

schuemie commented May 27, 2024

Investigate adding option for a duckdb backend or RSQLite+parquet #12

Investigate adding option for a duckdb backend or RSQLite+parquet #12

Comments

ablack3 commented Apr 8, 2021

ablack3 commented Apr 13, 2021 • edited Loading

vojtechhuser commented Jul 7, 2021

ablack3 commented Oct 14, 2021

schuemie commented Oct 14, 2021

chrisknoll commented Apr 13, 2022

ablack3 commented Apr 13, 2022 • edited Loading

chrisknoll commented Apr 14, 2022

ablack3 commented Nov 20, 2022

schuemie commented Nov 6, 2023

ablack3 commented Nov 6, 2023

schuemie commented May 27, 2024

ablack3 commented Apr 13, 2021 •

edited

Loading

ablack3 commented Apr 13, 2022 •

edited

Loading