Skip to content

R: Benchmarking ADBC (Snowflake) vs ODBC shows issue with larger datasets #2508

@JavOrraca

Description

@JavOrraca

What would you like help with?

After connecting to Snowflake with R's v0.16.0 adbcsnowflake and adbcdrivermanager, I benchmarked query + download times using ADBC vs ODBC.

This could be a case of user error or bad setup. My benchmarking (see screenshot below) tested ADBC vs ODBC on the same observations and columns, iterating over different combinations of row counts and columns.

Image

For the ODBC method, I established a connection to Snowflake with DBI and then used dplyr + dbplyr to interact with Snowflake, finally collecting the results as a tibble via dplyr::collect(). For the ADBC method, I used the connection string below and collected my results to a tibble using tibble::as_tibble() as shown on the adbcsnowflake website.

# Not shown here is the ODBC connection string using {DBI} but
# it relies on the same inputs as the ADBC inputs, below, and the
# ODBC connection was assigned to the object "conn_odbc"

library(adbcdrivermanager)

db <- adbc_database_init(
  adbcsnowflake::adbcsnowflake(),
  username = Sys.getenv("ldap_username"),
  adbc.snowflake.sql.account = Sys.getenv("sf_account"),
  adbc.snowflake.sql.uri.protocol = "https",
  adbc.snowflake.sql.uri.host = Sys.getenv("sf_host"),
  adbc.snowflake.sql.uri.port = "443",
  adbc.snowflake.sql.auth_type = "auth_oauth",
  adbc.snowflake.sql.client_option.auth_token = Sys.getenv("sf_auth_token"),
  adbc.snowflake.sql.role = Sys.getenv("sf_role"),
  adbc.snowflake.sql.warehouse = Sys.getenv("sf_warehouse"),
  adbc.snowflake.sql.database = Sys.getenv("sf_database")
)

conn_adbc <- adbc_connection_init(db)

# Benchmarking psuedo-code example

library(bench)
library(dplyr)
library(dbplyr)

benchmarking_odbc_vs_adbc <- bench::mark(
  min iterations = 5,
  ODBC_1000_5_col = conn_odbc |>
    dplyr::tbl(I("CATALOG.SCHEMA.TABLE_NAME")) |>
    dplyr::select(name, age, tenure, year_yyyy, month_mm) |>
    dplyr::filter(
      year_yyyy == 2024,
      month_mm == 12
    ) |>
    head(1000) |>
    dplyr::collect(),
  ADBC_1000_5_col = conn_adbc |>
    adbcdrivermanager::read_adbc(
      "SELECT TOP 1000 name, age, tenure, year_yyyy, month_mm
       FROM CATALOG.SCHEMA.TABLE_NAME
       WHERE year_yyyy = 2024
       AND month_mm = 12"
    ) |>
    tibble::as_tibble(),
  check = FALSE
)

As indicated on my screenshot above, my results show that ADBC performs better than ODBC on the smaller data sets but on the largest of data pulls (1 million rows and 75 columns), ODBC performs better. I ran 5 iterations for each combination of rows and columns and forn the largest data set, ADBC takes ~57 minutes to pull the data while ODBC takes ~50 seconds. There could be a number of reasons for this (e.g., user error, or driver config setup, or something even Snowflake side) but if anything on the surface jumps out at what I should be doing differently, any pointers would be appreciated.

Thanks so much for your help and awesome work with Arrow and these drivers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions