Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R: Benchmarking ADBC (Snowflake) vs ODBC shows issue with larger datasets #2508

Open
JavOrraca opened this issue Feb 10, 2025 · 24 comments
Open
Labels
Type: question Usage question

Comments

@JavOrraca
Copy link

JavOrraca commented Feb 10, 2025

What would you like help with?

After connecting to Snowflake with R's v0.16.0 adbcsnowflake and adbcdrivermanager, I benchmarked query + download times using ADBC vs ODBC.

This could be a case of user error or bad setup. My benchmarking (see screenshot below) tested ADBC vs ODBC on the same observations and columns, iterating over different combinations of row counts and columns.

Image

For the ODBC method, I established a connection to Snowflake with DBI and then used dplyr + dbplyr to interact with Snowflake, finally collecting the results as a tibble via dplyr::collect(). For the ADBC method, I used the connection string below and collected my results to a tibble using tibble::as_tibble() as shown on the adbcsnowflake website.

# Not shown here is the ODBC connection string using {DBI} but
# it relies on the same inputs as the ADBC inputs, below, and the
# ODBC connection was assigned to the object "conn_odbc"

library(adbcdrivermanager)

db <- adbc_database_init(
  adbcsnowflake::adbcsnowflake(),
  username = Sys.getenv("ldap_username"),
  adbc.snowflake.sql.account = Sys.getenv("sf_account"),
  adbc.snowflake.sql.uri.protocol = "https",
  adbc.snowflake.sql.uri.host = Sys.getenv("sf_host"),
  adbc.snowflake.sql.uri.port = "443",
  adbc.snowflake.sql.auth_type = "auth_oauth",
  adbc.snowflake.sql.client_option.auth_token = Sys.getenv("sf_auth_token"),
  adbc.snowflake.sql.role = Sys.getenv("sf_role"),
  adbc.snowflake.sql.warehouse = Sys.getenv("sf_warehouse"),
  adbc.snowflake.sql.database = Sys.getenv("sf_database")
)

conn_adbc <- adbc_connection_init(db)

# Benchmarking psuedo-code example

library(bench)
library(dplyr)
library(dbplyr)

benchmarking_odbc_vs_adbc <- bench::mark(
  min iterations = 5,
  ODBC_1000_5_col = conn_odbc |>
    dplyr::tbl(I("CATALOG.SCHEMA.TABLE_NAME")) |>
    dplyr::select(name, age, tenure, year_yyyy, month_mm) |>
    dplyr::filter(
      year_yyyy == 2024,
      month_mm == 12
    ) |>
    head(1000) |>
    dplyr::collect(),
  ADBC_1000_5_col = conn_adbc |>
    adbcdrivermanager::read_adbc(
      "SELECT TOP 1000 name, age, tenure, year_yyyy, month_mm
       FROM CATALOG.SCHEMA.TABLE_NAME
       WHERE year_yyyy = 2024
       AND month_mm = 12"
    ) |>
    tibble::as_tibble(),
  check = FALSE
)

As indicated on my screenshot above, my results show that ADBC performs better than ODBC on the smaller data sets but on the largest of data pulls (1 million rows and 75 columns), ODBC performs better. I ran 5 iterations for each combination of rows and columns and forn the largest data set, ADBC takes ~57 minutes to pull the data while ODBC takes ~50 seconds. There could be a number of reasons for this (e.g., user error, or driver config setup, or something even Snowflake side) but if anything on the surface jumps out at what I should be doing differently, any pointers would be appreciated.

Thanks so much for your help and awesome work with Arrow and these drivers.

@JavOrraca JavOrraca added the Type: question Usage question label Feb 10, 2025
@lidavidm
Copy link
Member

Hey, thanks for filing this.

Just to make sure, the SQL generated by d(b)plyr is the same as the hardcoded query you have? (And so presumably it became SELECT TOP 1000000 for the slow case?)

What are the column types? We do have to do some casting and I wonder if that's slowing things down. (I don't know why ODBC wouldn't suffer from the same issue, though.)

@JavOrraca
Copy link
Author

Good catch on the ADBC SELECT TOP 1000000... The base R head(1000000) SQL translation done by dbplyr actually translates it to LIMIT 1000000, but I re-ran the benchmarking after realizing this last week and got very similar results to the above.

The data types included in the widest data set that I tried (75 columns) included a mix of floats, integers, strings / varchars, dates, and logical / booleans. To be safe, I'll re-run my benchmarking this week and share the outputs here with the ADBC code revised to end with LIMIT.

@lidavidm
Copy link
Member

@zeroshade how expensive is integerToDecimal128 in https://github.com/apache/arrow-adbc/blob/main/go/adbc/driver/snowflake/record_reader.go?

I guess the thing for us is to profile something similar and see if there's anything obviously and unexpectedly expensive going on.

@zeroshade
Copy link
Member

It's potentially expensive for a large number of rows depending on various factors. The root of that is that we can't be sure what type because of some inconsistencies with the way snowflake handles NUMERIC columns. If you set the high precision option and don't mind receiving decimal128 data, that would eliminate the call to integerToDecimal128 and would let us see if that's the bottleneck.

@paleolimbot
Copy link
Member

Thanks for testing!

Before looking into Go it's worth separating read_adbc() (which is a thin wrapper around Go) from as_tibble() (which does the Arrow -> R conversion using nanoarrow's R package). It's possible that the Arrow -> R conversion is the issue and not the driver 🙂

@zeroshade
Copy link
Member

That would definitely be useful and helpful, while the conversion from integer to decimal128 is (relatively) expensive, it still shouldn't cause the level of slowdown being seen. So it would definitely make sense if the cause is the Arrow -> R conversion at high volumes of data.

@JavOrraca
Copy link
Author

Still digging into my issues at work and I appreciate your feedback. Below are some new updates for you:

  • Ending my read_adbc() query with LIMIT xxx didn't impact performance vs using SELECT TOP xxx
  • Using ADBC, I'm now unable to complete my profiling with R's profvis, or benchmarking with bench, on the biggest data set in my tests (1 million rows x 75 columns)
    • My RStudio session crashes with no detailed logs/notes, and this is obviously odd since last week I ran the tests in the picture above 🤷‍♂️
    • I tried the same process on several different servers / containers and experienced the same issue even after re-installing Go, adbcdrivermanager, and adbcsnowflake
  • More background about my working environment: I'm using a cloud platform developed in-house that launches a user namespaced container running Ubuntu 22, and it's from this container that I launch R and RStudio
    • This is a mature platform at my company that we've had for 7+ years and despite some minor inconveniences, it works really well for our data analysis and modeling needs including Python, R, git, etc.
  • I checked Dynatrace and nothing sticks out in terms of exceeding allowed CPU or RAM usage, and network analysis (traffic inbound and NIC packets received) looks "normal" comparing my ODBC vs ADBC approaches

And one final question: Is there an option in adbc_database_init or adbc_connection_init that I can set for the high precision option, potentially something like the pseudo code below? Thanks here... Tried Googling this and it's not clear what I should tweak in the driver options. 🙏

library(adbcdrivermanager)

db <- adbc_database_init(
  adbcsnowflake::adbcsnowflake(),
  # connection string details
  high_precision = TRUE
)

conn_adbc <- adbc_connection_init(db)

@JavOrraca
Copy link
Author

JavOrraca commented Feb 11, 2025

Okay quick update... I had a process crash but RStudio's built-in profiler picked up the partial results. This is for the same 1M rows / 75 col data set I've been testing. The profile output showed read_adbc() being ultra fast but 99% of the profile was spent on the as_tibble() interval (traced through as_tibble -> as_tibble.default -> as_tibble -> as.data.frame -> as.data.frame.nanoarrow_array_stream -> convert_array_stream -> collect_array_stream).

Hope this helps some and happy to test other approaches for you.

@zeroshade
Copy link
Member

99% of the profile was spent on the as_tibble() interval (traced through as_tibble -> as_tibble.default -> as_tibble -> as.data.frame -> as.data.frame.nanoarrow_array_stream -> convert_array_stream -> collect_array_stream)

@paleolimbot Looks like your theory was correct! Any suggestions for @JavOrraca here?

@paleolimbot
Copy link
Member

Thanks for the updates!

read_adbc() will always be fast (it's returning a lazy stream)...I should have remembered that you'd need to collect_array_stream() to actually pull all the batches. All of that is a very minimal wrapper around Go, so I would expect any issues with that to be Go-related.

I would be interested to know if there are decimals types in the output (or anything outside integer/double/string), with apologies if you already listed them. In particular, decimal-to-double and nested type conversion are the most likely to have a bug in them.

@zeroshade
Copy link
Member

@JavOrraca the connection option for using high precision is adbc.snowflake.sql.client_option.use_high_precision but by default it is set to true which would go through the integerToDecimal128 conversion function that @paleolimbot mentioned which would perform copies of the data etc.

You could try setting that option to FALSE instead and seeing if that has any effect?

In addition, if you could potentially share the dataset you're testing with (or something that can generate it) I can attempt to test it on my end and determine where the performance bottleneck is

@JavOrraca
Copy link
Author

Thanks folks for the feedback on next steps I can attempt. I'll try to continue testing this week!

@zeroshade - This is all highly regulated and HIPAA-protected member health data so unfortunately I won't be able to share any outputs.

@zeroshade
Copy link
Member

Any possibility you could create synthetic data of the same types? In the meantime I'll see if I can generate something that is 1M rows x 75 columns that I can test with

@zeroshade
Copy link
Member

I created a table in Snowflake consisting of 75 columns and 1M rows, where 25 columns were NUMBER(38,0), 25 were FLOAT and 25 were VARCHAR. So that it should be guaranteed to go through the integerToDecimal128 function (I checked the debugger to verify).

With a pure go mainprog, I was able to download the entire million rows in under 5 to 7 seconds (roughly 7x faster than ODBC in your original screenshot for the 1M rows).

Just to confirm things for myself, I also tested using python (which will go through a C stream of data just like R would) and rather than timing just streaming the results, I also included creating a pyarrow table from the streamed records (i.e. materializing the entire result set in memory at once rather than just grabbing one batch at a time), the corresponding code looks like:

import adbc_driver_snowflake.dbapi

with adbc_driver_snowflake.dbapi.connect("<snowflake URI>") as conn, conn.cursor() as cur:
    cur.execute('SELECT * FROM "my_table"')
    tbl = cur.fetch_arrow_table()
    print(tbl.num_rows)

And even with the added cost of materializing the entire result set, it still only takes around 20s for the python to run. So whatever is causing it to take so long seems to be specific to R, and not on the Go side. @paleolimbot any ideas?

@paleolimbot
Copy link
Member

I no longer have a Snowflake account, but the equivalent code should be something like:

library(adbcdrivermanager)

con <- adbcsnowflake::adbcsnowflake() |> adbc_database_init(<connection stuff>) |> adbc_connection_init()
stream <- con |> read_adbc("SELECT * from "my_table")
batches <- nanoarrow::collect_array_stream()
tibble::as_tibble(nanoarrow::basic_array_stream(batches))

@JavOrraca
Copy link
Author

JavOrraca commented Feb 12, 2025

I created a table in Snowflake consisting of 75 columns and 1M rows, where 25 columns were NUMBER(38,0), 25 were FLOAT and 25 were VARCHAR.

@zeroshade This image is so janky 😅 but below are the exact field types:

Image

And here are the counts by field type:

Image

@paleolimbot
Copy link
Member

Something else worth trying is to use read_adbc() |> arrow::as_arrow_table() |> as_tibble() (which would use a different conversion system).

Given some IPC data I'm happy to give any of this a try!

@zeroshade
Copy link
Member

Okay, so I'm not quite sure what the difference here is since I'm not an R developer at all but there's a significant performance difference between this (essentially what you are doing in your benchmark):

df <- conn |> read_adbc('SELECT * FROM "my_table"') |> tibble::as_tibble()

and the following:

df <- conn |> read_adbc('SELECT * FROM "my_table"') |> arrow::as_arrow_table() |> tibble::as_tibble()

I did a little bit of testing: The first scenario (i.e. what you're doing) takes between 28 seconds and over a minute while the second version (taking a trip through arrow::as_arrow_table before going to tibble) takes just under 7 seconds.

@JavOrraca can you try adding the |> arrow::as_arrow_table() before |> tibble::as_tibble() and see what that does for your performance?

@paleolimbot I have no idea what would cause this difference but it's completely reproducible. As far as I can tell, the driver and Golang side is producing all the batches super quickly regardless and it's only whatever work is happening between nanoarrow and tibble that's causing the slowdown and something in the Arrow R package's implementation makes the conversion to tibble much faster.

@paleolimbot
Copy link
Member

@zeroshade Can you dump your result to IPC (maintaining the batch sizes you're getting from the driver) and send it over? (base64 as a gist or email or pigeon or whatever you have!).

@zeroshade
Copy link
Member

Sent you an email with a link @paleolimbot to download the IPC stream result. Was easiest to just stick it on google drive given it was 825MB lol

@paleolimbot
Copy link
Member

Thanks! I can reproduce. I have a few other things I have to do first but should be able to get this sorted this week 🙂

@JavOrraca
Copy link
Author

@JavOrraca can you try adding the |> arrow::as_arrow_table() before |> tibble::as_tibble() and see what that does for your performance?

@zeroshade This worked! At this very moment, 29s (ADBC) vs 71s (ODBC). Huge thank you. Rerunning the entire results of my benchmarking now.

@lidavidm
Copy link
Member

lidavidm commented Feb 13, 2025

Thanks @paleolimbot @zeroshade for helping out! Going from 60x slower to 2x faster is great 🎉

@JavOrraca
Copy link
Author

Seriously thank you all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: question Usage question
Projects
None yet
Development

No branches or pull requests

4 participants