R: Benchmarking ADBC (Snowflake) vs ODBC shows issue with larger datasets

### What would you like help with?

After connecting to Snowflake with R's v0.16.0 [`adbcsnowflake`](https://arrow.apache.org/adbc/current/r/adbcsnowflake/index.html) and [`adbcdrivermanager`](https://arrow.apache.org/adbc/current/r/adbcdrivermanager/), I benchmarked query + download times using ADBC vs ODBC.

This could be a case of user error or bad setup. My benchmarking (see screenshot below) tested ADBC vs ODBC on the same observations and columns, iterating over different combinations of row counts and columns.

![Image](https://github.com/user-attachments/assets/85fdb643-b950-404f-9d12-6acaf896fd83)

For the ODBC method, I established a connection to Snowflake with `DBI` and then used `dplyr` + `dbplyr` to interact with Snowflake, finally collecting the results as a tibble via `dplyr::collect()`. For the ADBC method, I used the connection string below and collected my results to a tibble using `tibble::as_tibble()` as shown on the [`adbcsnowflake`](https://arrow.apache.org/adbc/current/r/adbcsnowflake/index.html) website.

```{r}
# Not shown here is the ODBC connection string using {DBI} but
# it relies on the same inputs as the ADBC inputs, below, and the
# ODBC connection was assigned to the object "conn_odbc"

library(adbcdrivermanager)

db <- adbc_database_init(
  adbcsnowflake::adbcsnowflake(),
  username = Sys.getenv("ldap_username"),
  adbc.snowflake.sql.account = Sys.getenv("sf_account"),
  adbc.snowflake.sql.uri.protocol = "https",
  adbc.snowflake.sql.uri.host = Sys.getenv("sf_host"),
  adbc.snowflake.sql.uri.port = "443",
  adbc.snowflake.sql.auth_type = "auth_oauth",
  adbc.snowflake.sql.client_option.auth_token = Sys.getenv("sf_auth_token"),
  adbc.snowflake.sql.role = Sys.getenv("sf_role"),
  adbc.snowflake.sql.warehouse = Sys.getenv("sf_warehouse"),
  adbc.snowflake.sql.database = Sys.getenv("sf_database")
)

conn_adbc <- adbc_connection_init(db)

# Benchmarking psuedo-code example

library(bench)
library(dplyr)
library(dbplyr)

benchmarking_odbc_vs_adbc <- bench::mark(
  min iterations = 5,
  ODBC_1000_5_col = conn_odbc |>
    dplyr::tbl(I("CATALOG.SCHEMA.TABLE_NAME")) |>
    dplyr::select(name, age, tenure, year_yyyy, month_mm) |>
    dplyr::filter(
      year_yyyy == 2024,
      month_mm == 12
    ) |>
    head(1000) |>
    dplyr::collect(),
  ADBC_1000_5_col = conn_adbc |>
    adbcdrivermanager::read_adbc(
      "SELECT TOP 1000 name, age, tenure, year_yyyy, month_mm
       FROM CATALOG.SCHEMA.TABLE_NAME
       WHERE year_yyyy = 2024
       AND month_mm = 12"
    ) |>
    tibble::as_tibble(),
  check = FALSE
)
```

As indicated on my screenshot above, my results show that ADBC performs better than ODBC on the smaller data sets but on the largest of data pulls (1 million rows and 75 columns), ODBC performs better. I ran 5 iterations for each combination of rows and columns and forn the largest data set, ADBC takes ~57 minutes to pull the data while ODBC takes ~50 seconds. There could be a number of reasons for this (e.g., user error, or driver config setup, or something even Snowflake side) but if anything on the surface jumps out at what I should be doing differently, any pointers would be appreciated.

Thanks so much for your help and awesome work with Arrow and these drivers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

R: Benchmarking ADBC (Snowflake) vs ODBC shows issue with larger datasets #2508

What would you like help with?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

R: Benchmarking ADBC (Snowflake) vs ODBC shows issue with larger datasets #2508

Description

What would you like help with?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions