-
Notifications
You must be signed in to change notification settings - Fork 159
Description
What would you like help with?
After connecting to Snowflake with R's v0.16.0 adbcsnowflake
and adbcdrivermanager
, I benchmarked query + download times using ADBC vs ODBC.
This could be a case of user error or bad setup. My benchmarking (see screenshot below) tested ADBC vs ODBC on the same observations and columns, iterating over different combinations of row counts and columns.
For the ODBC method, I established a connection to Snowflake with DBI
and then used dplyr
+ dbplyr
to interact with Snowflake, finally collecting the results as a tibble via dplyr::collect()
. For the ADBC method, I used the connection string below and collected my results to a tibble using tibble::as_tibble()
as shown on the adbcsnowflake
website.
# Not shown here is the ODBC connection string using {DBI} but
# it relies on the same inputs as the ADBC inputs, below, and the
# ODBC connection was assigned to the object "conn_odbc"
library(adbcdrivermanager)
db <- adbc_database_init(
adbcsnowflake::adbcsnowflake(),
username = Sys.getenv("ldap_username"),
adbc.snowflake.sql.account = Sys.getenv("sf_account"),
adbc.snowflake.sql.uri.protocol = "https",
adbc.snowflake.sql.uri.host = Sys.getenv("sf_host"),
adbc.snowflake.sql.uri.port = "443",
adbc.snowflake.sql.auth_type = "auth_oauth",
adbc.snowflake.sql.client_option.auth_token = Sys.getenv("sf_auth_token"),
adbc.snowflake.sql.role = Sys.getenv("sf_role"),
adbc.snowflake.sql.warehouse = Sys.getenv("sf_warehouse"),
adbc.snowflake.sql.database = Sys.getenv("sf_database")
)
conn_adbc <- adbc_connection_init(db)
# Benchmarking psuedo-code example
library(bench)
library(dplyr)
library(dbplyr)
benchmarking_odbc_vs_adbc <- bench::mark(
min iterations = 5,
ODBC_1000_5_col = conn_odbc |>
dplyr::tbl(I("CATALOG.SCHEMA.TABLE_NAME")) |>
dplyr::select(name, age, tenure, year_yyyy, month_mm) |>
dplyr::filter(
year_yyyy == 2024,
month_mm == 12
) |>
head(1000) |>
dplyr::collect(),
ADBC_1000_5_col = conn_adbc |>
adbcdrivermanager::read_adbc(
"SELECT TOP 1000 name, age, tenure, year_yyyy, month_mm
FROM CATALOG.SCHEMA.TABLE_NAME
WHERE year_yyyy = 2024
AND month_mm = 12"
) |>
tibble::as_tibble(),
check = FALSE
)
As indicated on my screenshot above, my results show that ADBC performs better than ODBC on the smaller data sets but on the largest of data pulls (1 million rows and 75 columns), ODBC performs better. I ran 5 iterations for each combination of rows and columns and forn the largest data set, ADBC takes ~57 minutes to pull the data while ODBC takes ~50 seconds. There could be a number of reasons for this (e.g., user error, or driver config setup, or something even Snowflake side) but if anything on the surface jumps out at what I should be doing differently, any pointers would be appreciated.
Thanks so much for your help and awesome work with Arrow and these drivers.