Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate adding option for a duckdb backend or RSQLite+parquet #12

Open
ablack3 opened this issue Apr 8, 2021 · 11 comments
Open

Investigate adding option for a duckdb backend or RSQLite+parquet #12

ablack3 opened this issue Apr 8, 2021 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@ablack3
Copy link
Collaborator

ablack3 commented Apr 8, 2021

DuckDB is a file-based database similar to sqlite and has an R interface. It looks like it might provider performance benefits over SQLite.

https://duckdb.org/docs/api/r

Investigate performance differences between DuckDB and SQLite. Look into implementing an Andromeda DuckDB backend.

@ablack3 ablack3 self-assigned this Apr 8, 2021
@ablack3 ablack3 added the enhancement New feature or request label Apr 8, 2021
@ablack3 ablack3 changed the title Investigate adding option for a duckdb backend Investigate adding option for a duckdb backend or RSQLite+parquet Apr 13, 2021
@ablack3
Copy link
Collaborator Author

ablack3 commented Apr 13, 2021

Some links
Add parquet support to RSQLite: r-dbi/RSQLite#325
SQLite parquet virtual table: https://github.com/cldellow/sqlite-parquet-vtable
Why would this be great? https://cldellow.com/2018/06/22/sqlite-parquet-vtable.html
Maybe duckdb would be easier and better route? https://duckdb.org/docs/api/r

Does SQLite cover all of OHDSI SQL? If not, where does it fall short? I'm wondering if duckdb would support all of OHDSI SQL if SQLite does not.

  • update - found this on the forums Sounds like the most recent version of RSQLite does support all of OHDSI SQL.

@vojtechhuser
Copy link

+1

@ablack3
Copy link
Collaborator Author

ablack3 commented Oct 14, 2021

Some simple join tests comparing duckdb and SQLite. According to the duckdb FAQ it has not been optimized for performance yet.

# Test large joins -----------------

n_covariates <- 1000
df1 <- as.data.frame(lapply(1:n_covariates, function(n) rbinom(1e6, 1, .2)))
names(df1) <- paste0("covariate", 1:n_covariates)
df1 <- cbind(data.frame(id = 1:1e6), df1)
df2 <- data.frame(id = sample(1:1e6, 10, replace = T))

### duckdb --------------------------
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))
con <- duckdb::dbConnect(duckdb::duckdb(), file.path(getOption("andromedaTempFolder"), "test.duckdb"))

duckdb::dbWriteTable(con, "df1", df1)
duckdb::dbWriteTable(con, "df2", df2)

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#>    user  system elapsed 
#>    4.92    0.05    5.01

system.time({
  DBI::dbExecute(con, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(con, "create index idx_df2_id on df2(id);")
})
#>    user  system elapsed 
#>    0.40    0.01    0.45

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#>    user  system elapsed 
#>    2.06    0.05    2.19

duckdb::dbDisconnect(con, shutdown = TRUE)
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))

### sqlite ---------------
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))
conn <- RSQLite::dbConnect(RSQLite::SQLite(), file.path(getOption("andromedaTempFolder"), "test.sqlite"))

RSQLite::dbWriteTable(conn, "df1", df1)
RSQLite::dbWriteTable(conn, "df2", df2)

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#>    user  system elapsed 
#>    0.48    1.73    2.50

system.time({
  DBI::dbExecute(conn, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(conn, "create index idx_df2_id on df2(id);")
})
#>    user  system elapsed 
#>    1.06    2.54    3.75

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#>    user  system elapsed 
#>    0.45    2.37    2.89

DBI::dbDisconnect(conn)
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))

Created on 2021-10-13 by the reprex package (v2.0.1)

@schuemie
Copy link
Member

Thanks, this is very informative! There appears to be a large hit in performance when using DuckDB. And performance is an important aspect of Andromeda.

@chrisknoll
Copy link

So, after wrestling with SQLLite and their lack of full fledged date types, single-schema technology limitations, lack of 'real' temp tables, and more, @schuemie pointed me to duckdb and I also found this issue.

I'm trying to create tests for CohortIncidence, and it is heavy on the date functions, and SqLite is failing hard. The DatabaseConnector attempts to work around it are valiant, but probably not sustainable.

I'll be looking to use duckdb in my tests for CohortIncidence, and I'd like to know if it would be OK for me to adopt it and possibly introduce SqlRender rules for translation (although it doesn't seem like we'd need much). Performance is not as critical for me, so even if you wanted to keep Andromeda on SqLite, i'd still appreicate it if we worked to support duckdb as a DatabaseConnector option.

@ablack3
Copy link
Collaborator Author

ablack3 commented Apr 13, 2022

The performance hit is unfortunate and surprising since duckdb is supposed to be designed for OLAP and should outperform sqlite in the long run. My guess is that it is because duckdb is still new and sqlite has had a long time to mature. Indexes or use of multi-threading might help. I'm very much in favor of starting to use it in some capacity (maybe for Eunomia) just so we can have date support. I agree the date workarounds are a loosing battle long term.

Here is a related issue: OHDSI/DatabaseConnector#153
@chrisknoll - do you need any changes in Andromeda? I think I have a working Andromeda branch that uses duckdb.

@chrisknoll
Copy link

No, I just was hoping to use database connector but it depends on some attribute being present in order to use the query/execute commands. Also, I think there will be some light sqlrender dialect work...but I don't use andromeda directly for anything at the moment.

@ablack3
Copy link
Collaborator Author

ablack3 commented Nov 20, 2022

We're goin with arrow!

@ablack3 ablack3 closed this as completed Nov 20, 2022
@schuemie
Copy link
Member

schuemie commented Nov 6, 2023

FYI: I reran the tests again to see if DuckDB has improved. It has, but SQLite is still the winner. (which is odd, because I get amazing performance out of DuckDB in Python):

# Test large joins -----------------

n_covariates <- 1000
df1 <- as.data.frame(lapply(1:n_covariates, function(n) rbinom(1e6, 1, .2)))
names(df1) <- paste0("covariate", 1:n_covariates)
df1 <- cbind(data.frame(id = 1:1e6), df1)
df2 <- data.frame(id = sample(1:1e6, 10, replace = T))

### duckdb --------------------------
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))
con <- duckdb::dbConnect(duckdb::duckdb(), file.path(getOption("andromedaTempFolder"), "test.duckdb"))

duckdb::dbWriteTable(con, "df1", df1)
duckdb::dbWriteTable(con, "df2", df2)

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 1.73    0.03    0.58 

system.time({
  DBI::dbExecute(con, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(con, "create index idx_df2_id on df2(id);")
})
# user  system elapsed 
# 0.19    0.01    0.17 

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 1.62    0.05    0.56 

duckdb::dbDisconnect(con, shutdown = TRUE)
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))

### sqlite ---------------
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))
conn <- RSQLite::dbConnect(RSQLite::SQLite(), file.path(getOption("andromedaTempFolder"), "test.sqlite"))

RSQLite::dbWriteTable(conn, "df1", df1)
RSQLite::dbWriteTable(conn, "df2", df2)

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 0.11    1.11    1.25 

system.time({
  DBI::dbExecute(conn, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(conn, "create index idx_df2_id on df2(id);")
})
# user  system elapsed 
# 0.43    1.05    1.50 

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 0.21    1.00    1.20 

DBI::dbDisconnect(conn)
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))

@ablack3
Copy link
Collaborator Author

ablack3 commented Nov 6, 2023

Good to know. Thanks Martijn. Kind of strange since I'd assume both R and python are calling the same C++ code.

@schuemie
Copy link
Member

I reran the 'benchmark', but made the data a bit larger. If you look at the 'user' category of time, SQLite is still superior, but the 'elapsed' category tells a different story. I also timed the data insertion, and there duckdb is the undisputed winner. I'll look into switching the backend to duckdb.

# Test large joins -----------------
options("andromedaTempFolder" ="d:/andromedaTemp")

n_covariates <- 1000
df1 <- as.data.frame(lapply(1:n_covariates, function(n) rbinom(1e6, 1, .2)))
names(df1) <- paste0("covariate", 1:n_covariates)
df1 <- cbind(data.frame(id = 1:1e7), df1)
df2 <- data.frame(id = sample(1:1e7, 10, replace = T))

### duckdb --------------------------
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))
con <- duckdb::dbConnect(duckdb::duckdb(), file.path(getOption("andromedaTempFolder"), "test.duckdb"))

system.time({
  duckdb::dbWriteTable(con, "df1", df1)
  duckdb::dbWriteTable(con, "df2", df2)
})
# user  system elapsed 
# 246.33   17.24  261.17 

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 14.11    0.22    1.70 

system.time({
  DBI::dbExecute(con, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(con, "create index idx_df2_id on df2(id);")
})
# user  system elapsed 
# 5.05   13.64    2.49 

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 14.48    0.06    1.64 

duckdb::dbDisconnect(con, shutdown = TRUE)
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))

### sqlite ---------------
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))
conn <- RSQLite::dbConnect(RSQLite::SQLite(), file.path(getOption("andromedaTempFolder"), "test.sqlite"))

system.time({
RSQLite::dbWriteTable(conn, "df1", df1)
RSQLite::dbWriteTable(conn, "df2", df2)
})
# user  system elapsed 
# 467.03   17.60  485.41 

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 1.63    6.39    8.02 

system.time({
  DBI::dbExecute(conn, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(conn, "create index idx_df2_id on df2(id);")
})
# user  system elapsed 
# 3.61    7.00   10.61 

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 2.04    6.25    8.30 

DBI::dbDisconnect(conn)
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))

@schuemie schuemie reopened this May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants