-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R: Benchmarking ADBC (Snowflake) vs ODBC shows issue with larger datasets #2508
Comments
Hey, thanks for filing this. Just to make sure, the SQL generated by d(b)plyr is the same as the hardcoded query you have? (And so presumably it became What are the column types? We do have to do some casting and I wonder if that's slowing things down. (I don't know why ODBC wouldn't suffer from the same issue, though.) |
Good catch on the ADBC The data types included in the widest data set that I tried (75 columns) included a mix of floats, integers, strings / varchars, dates, and logical / booleans. To be safe, I'll re-run my benchmarking this week and share the outputs here with the ADBC code revised to end with |
@zeroshade how expensive is I guess the thing for us is to profile something similar and see if there's anything obviously and unexpectedly expensive going on. |
It's potentially expensive for a large number of rows depending on various factors. The root of that is that we can't be sure what type because of some inconsistencies with the way snowflake handles |
Thanks for testing! Before looking into Go it's worth separating |
That would definitely be useful and helpful, while the conversion from integer to decimal128 is (relatively) expensive, it still shouldn't cause the level of slowdown being seen. So it would definitely make sense if the cause is the Arrow -> R conversion at high volumes of data. |
Still digging into my issues at work and I appreciate your feedback. Below are some new updates for you:
And one final question: Is there an option in
|
Okay quick update... I had a process crash but RStudio's built-in profiler picked up the partial results. This is for the same 1M rows / 75 col data set I've been testing. The profile output showed Hope this helps some and happy to test other approaches for you. |
@paleolimbot Looks like your theory was correct! Any suggestions for @JavOrraca here? |
Thanks for the updates!
I would be interested to know if there are decimals types in the output (or anything outside integer/double/string), with apologies if you already listed them. In particular, decimal-to-double and nested type conversion are the most likely to have a bug in them. |
@JavOrraca the connection option for using high precision is You could try setting that option to FALSE instead and seeing if that has any effect? In addition, if you could potentially share the dataset you're testing with (or something that can generate it) I can attempt to test it on my end and determine where the performance bottleneck is |
Thanks folks for the feedback on next steps I can attempt. I'll try to continue testing this week! @zeroshade - This is all highly regulated and HIPAA-protected member health data so unfortunately I won't be able to share any outputs. |
Any possibility you could create synthetic data of the same types? In the meantime I'll see if I can generate something that is 1M rows x 75 columns that I can test with |
I created a table in Snowflake consisting of 75 columns and 1M rows, where 25 columns were With a pure go mainprog, I was able to download the entire million rows in under 5 to 7 seconds (roughly 7x faster than ODBC in your original screenshot for the 1M rows). Just to confirm things for myself, I also tested using python (which will go through a C stream of data just like R would) and rather than timing just streaming the results, I also included creating a pyarrow table from the streamed records (i.e. materializing the entire result set in memory at once rather than just grabbing one batch at a time), the corresponding code looks like: import adbc_driver_snowflake.dbapi
with adbc_driver_snowflake.dbapi.connect("<snowflake URI>") as conn, conn.cursor() as cur:
cur.execute('SELECT * FROM "my_table"')
tbl = cur.fetch_arrow_table()
print(tbl.num_rows) And even with the added cost of materializing the entire result set, it still only takes around 20s for the python to run. So whatever is causing it to take so long seems to be specific to R, and not on the Go side. @paleolimbot any ideas? |
I no longer have a Snowflake account, but the equivalent code should be something like: library(adbcdrivermanager)
con <- adbcsnowflake::adbcsnowflake() |> adbc_database_init(<connection stuff>) |> adbc_connection_init()
stream <- con |> read_adbc("SELECT * from "my_table")
batches <- nanoarrow::collect_array_stream()
tibble::as_tibble(nanoarrow::basic_array_stream(batches)) |
@zeroshade This image is so janky 😅 but below are the exact field types: ![]() And here are the counts by field type: ![]() |
Something else worth trying is to use Given some IPC data I'm happy to give any of this a try! |
Okay, so I'm not quite sure what the difference here is since I'm not an R developer at all but there's a significant performance difference between this (essentially what you are doing in your benchmark): df <- conn |> read_adbc('SELECT * FROM "my_table"') |> tibble::as_tibble() and the following: df <- conn |> read_adbc('SELECT * FROM "my_table"') |> arrow::as_arrow_table() |> tibble::as_tibble() I did a little bit of testing: The first scenario (i.e. what you're doing) takes between 28 seconds and over a minute while the second version (taking a trip through arrow::as_arrow_table before going to tibble) takes just under 7 seconds. @JavOrraca can you try adding the @paleolimbot I have no idea what would cause this difference but it's completely reproducible. As far as I can tell, the driver and Golang side is producing all the batches super quickly regardless and it's only whatever work is happening between nanoarrow and tibble that's causing the slowdown and something in the Arrow R package's implementation makes the conversion to tibble much faster. |
@zeroshade Can you dump your result to IPC (maintaining the batch sizes you're getting from the driver) and send it over? (base64 as a gist or email or pigeon or whatever you have!). |
Sent you an email with a link @paleolimbot to download the IPC stream result. Was easiest to just stick it on google drive given it was 825MB lol |
Thanks! I can reproduce. I have a few other things I have to do first but should be able to get this sorted this week 🙂 |
@zeroshade This worked! At this very moment, 29s (ADBC) vs 71s (ODBC). Huge thank you. Rerunning the entire results of my benchmarking now. |
Thanks @paleolimbot @zeroshade for helping out! Going from 60x slower to 2x faster is great 🎉 |
Seriously thank you all! |
What would you like help with?
After connecting to Snowflake with R's v0.16.0
adbcsnowflake
andadbcdrivermanager
, I benchmarked query + download times using ADBC vs ODBC.This could be a case of user error or bad setup. My benchmarking (see screenshot below) tested ADBC vs ODBC on the same observations and columns, iterating over different combinations of row counts and columns.
For the ODBC method, I established a connection to Snowflake with
DBI
and then useddplyr
+dbplyr
to interact with Snowflake, finally collecting the results as a tibble viadplyr::collect()
. For the ADBC method, I used the connection string below and collected my results to a tibble usingtibble::as_tibble()
as shown on theadbcsnowflake
website.As indicated on my screenshot above, my results show that ADBC performs better than ODBC on the smaller data sets but on the largest of data pulls (1 million rows and 75 columns), ODBC performs better. I ran 5 iterations for each combination of rows and columns and forn the largest data set, ADBC takes ~57 minutes to pull the data while ODBC takes ~50 seconds. There could be a number of reasons for this (e.g., user error, or driver config setup, or something even Snowflake side) but if anything on the surface jumps out at what I should be doing differently, any pointers would be appreciated.
Thanks so much for your help and awesome work with Arrow and these drivers.
The text was updated successfully, but these errors were encountered: