Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Error on Table Merging in arrow for R #39038

Open
TPDeramus opened this issue Dec 1, 2023 · 12 comments
Open

[R] Error on Table Merging in arrow for R #39038

TPDeramus opened this issue Dec 1, 2023 · 12 comments

Comments

@TPDeramus
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

Hi developers.

I'm having an issue where I'm trying to use full_join() on two tables (subset from the same data but filtered and operated on and appended to save memory), but it keeps throwing the following error:

Error: NotImplemented: Function 'coalesce' has no kernel matching input types (numeric(0)
attr(,"class")
[1] NA, numeric(0)
attr(,"class")

Specifically, it looks something like the following:

library(arrow)
library(tidyverse)
library(fastDummies)

  temp <- open_csv_dataset(sources = cohort_csvs) %>% compute()
  
  Subs <- data.frame(temp %>% distinct(key) %>% collect())
  
  for (Subnum in 1:dim(Subs)[1]) {
    out <-
      data.frame(temp %>% filter(key == Subs[Subnum, ]) %>% collect())
      out[is.na(out)] <- 'NA'
      out$tags <- 'NA'
      out <-
        dummy_cols(
          out,
          select_columns = "terms",
          remove_selected_columns = FALSE,
          omit_colname_prefix = TRUE
        )
      out <-
        dummy_cols(
          out,
          select_columns = "tags",
          remove_selected_columns = FALSE,
          omit_colname_prefix = TRUE
        )
      if (Subnum == 1){
        Out_table <- arrow_table(out)
      } else {
        Out_table <-Out_table %>% full_join(out)
        }

However, when it reaches past the first part of the loop to the full join, it throws the error regardless of the call used to make the full_join():

Out_table %>% full_join(out)
Error: NotImplemented: Function 'coalesce' has no kernel matching input types (numeric(0)
attr(,"class")
[1] NA, numeric(0)
attr(,"class")
[1] NA)

full_join(Out_table,arrow_table(out))
Error: NotImplemented: Function 'coalesce' has no kernel matching input types (numeric(0)
attr(,"class")
[1] NA, numeric(0)
attr(,"class")
[1] NA)

full_join(Out_table,out)
Error: NotImplemented: Function 'coalesce' has no kernel matching input types (numeric(0)
attr(,"class")
[1] NA, numeric(0)
attr(,"class")
[1] NA)

It will not however, throw any error or display issues with left, right, inner, semi, or anti join.

I kind of need all columns to be retained during the joining, even if as NAs.

Any idea what might be causing the issue?

Version info:
OS:
NAME="Red Hat Enterprise Linux Server"
VERSION="7.9 (Maipo)"

R Version:
R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid"

RStudio Version:
RStudio Server 2022.07.0 Build 548

Session Info:

sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] lubridate_1.9.0   timechange_0.1.1  rmarkdown_2.18    here_1.0.1        fastDummies_1.7.3 arrow_12.0.1.1   
 [7] data.table_1.14.6 toolbox_0.1.0     janitor_2.2.0     forcats_0.5.2     stringr_1.4.1     dplyr_1.0.10     
[13] purrr_0.3.5       readr_2.1.3       tidyr_1.2.1       tibble_3.1.8      ggplot2_3.4.0     tidyverse_1.3.2  

loaded via a namespace (and not attached):
 [1] httr_1.4.4          vroom_1.6.0         bit64_4.0.5         jsonlite_1.8.3      viridisLite_0.4.1   splines_4.2.1      
 [7] modelr_0.1.10       assertthat_0.2.1    pander_0.6.5        renv_0.16.0         googlesheets4_1.0.1 cellranger_1.1.0   
[13] yaml_2.3.6          pillar_1.8.1        backports_1.4.1     lattice_0.20-45     glue_1.6.2          digest_0.6.30      
[19] rvest_1.0.3         snakecase_0.11.1    colorspace_2.0-3    htmltools_0.5.5     Matrix_1.5-3        survey_4.1-1       
[25] pkgconfig_2.0.3     broom_1.0.1         haven_2.5.1         scales_1.2.1        webshot_0.5.4       svglite_2.1.0      
[31] tzdb_0.3.0          googledrive_2.0.0   generics_0.1.3      tictoc_1.1          ellipsis_0.3.2      DT_0.26            
[37] withr_2.5.0         cli_3.4.1           survival_3.3-1      magrittr_2.0.3      crayon_1.5.2        readxl_1.4.1       
[43] evaluate_0.18       fs_1.5.2            fansi_1.0.3         xml2_1.3.3          tableone_0.13.2     tools_4.2.1        
[49] mitools_2.4         hms_1.1.2           gargle_1.2.1        lifecycle_1.0.3     munsell_0.5.0       reprex_2.0.2       
[55] kableExtra_1.3.4    compiler_4.2.1      systemfonts_1.0.4   rlang_1.1.2         grid_4.2.1          rstudioapi_0.14    
[61] htmlwidgets_1.6.0   gtable_0.3.1        DBI_1.1.3           R6_2.5.1            knitr_1.41          bit_4.0.5          
[67] fastmap_1.1.0       utf8_1.2.2          rprojroot_2.0.3     stringi_1.7.8       parallel_4.2.1      Rcpp_1.0.9         
[73] vctrs_0.6.4         dbplyr_2.2.1        tidyselect_1.2.0    xfun_0.35 

Component(s)

R

@amoeba
Copy link
Member

amoeba commented Dec 1, 2023

Hi @TPDeramus as was asked in your StackOverflow post, a way for us to reproduce this would be best. The error message you're getting is very strange. Normally it looks like this:

NotImplemented: Function 'coalesce' has no kernel matching input types (int32, string)

If you aren't able to provide a reproducible example, adding a browser() statement before the line of code with the full_join and taking a close look at the arguments might give us some clues as to what's going on here.

@TPDeramus
Copy link
Author

Apologies but I am not well versed in the implementations of browser().

And it's doubly problematic because this is not always thrown as a typical error.

Occasionally (but not always), if passed to a variable, its saved as a list item containing the error:

> Out_table
Error: NotImplemented: Function 'coalesce' has no kernel matching input types (numeric(0)
attr(,"class")
[1] NA, numeric(0)
attr(,"class")
[1] NA)
> typeof(Out_table)
[1] "list"

But the example I have that works is also a list:

> Dummytable
Table (query)
ID: string (coalesce(ID.x, ID.y))
String: string (coalesce(String.x, String.y))
Value_A: int32
Value_F: int32
Value_G: int32
Value_K: int32
Value_B: int32
Value_C: int32
Value_L: int32
Value_H: int32
Value_M: int32
Value_D: int32
Value_I: int32
Value_N: int32 (coalesce(Value_N.x, Value_N.y))
Value_E: int32
Value_J: int32

See $.data for the source Arrow object
> typeof(Dummytable)
[1] "list"

As such, it's hard to debug via rlang::last_trace() and the like because it is not read as something that can be traced in the terminal, and in browser() it will frequently not report as an error and simply continue or exit the browser() session as if it proceeded without an error.

However, from what I was able to gather within at least one session of browser() from a call to Out_table %>% full_join(out), this was the order of the commands:

Error: NotImplemented: Function 'coalesce' has no kernel matching input types (numeric(0)
attr(,"class")
[1] NA, numeric(0)
attr(,"class")
[1] NA)
10. compute___expr__type(self, schema)
9. .$type(old_schm)
8. .f(.x[[i]], ...)
7. map(.data$selected_columns, ~.$type(old_schm))
6. implicit_schema(x)
5. collapse.arrow_dplyr_query(x)
4. do_join(x, y, by, copy, suffix, ..., keep = keep, join_type = "FULL_OUTER")
3. full_join.arrow_dplyr_query(., out)
2. full_join(., out)
1. Out_table %>% full_join(out)

Interestingly enough, when I made the following changes to the code:

library(arrow)
library(tidyverse)
library(fastDummies)

  temp <- open_csv_dataset(sources = cohort_csvs) %>% compute()
  
  Subs <- data.frame(temp %>% distinct(key) %>% collect())
  
  for (Subnum in 1:dim(Subs)[1]) {
    out <-
      data.frame(temp %>% filter(key == Subs[Subnum, ]) %>% collect())
      out[is.na(out)] <- 'NA'
      out$tags <- 'NA'
      out <-
        dummy_cols(
          out,
          select_columns = "terms",
          remove_selected_columns = FALSE,
          omit_colname_prefix = TRUE
        )
      out <-
        dummy_cols(
          out,
          select_columns = "tags",
          remove_selected_columns = FALSE,
          omit_colname_prefix = TRUE
        )
      if (Subnum == 1){
        Out_table <- arrow_table(out)
      } else {
        #Out_table <-Out_table %>% full_join(out)
        Out_table %>% full_join(out)
        }

And just didn't assign it to a variable at all, it ran just fine.

This seems to happen when Subnum hits a value of 3, giving me the impression it's not quite sure what to do with the NA values once it hits the third table to be joined.

Do you think this can be addressed with some call to fill.null or similar?
https://arrow.apache.org/docs/python/generated/pyarrow.compute.fill_null.html

@TPDeramus
Copy link
Author

Okay scratch that.

The error will happen as soon as the second iteration. Probably just a typo from troubleshooting on my part.

@TPDeramus
Copy link
Author

Further, the use of concat_tables()

library(arrow)
library(tidyverse)
library(fastDummies)

temp <- open_csv_dataset(sources = cohort_csvs) %>% compute()

Subs <- data.frame(temp %>% distinct(key) %>% collect())

for (Subnum in 1:dim(Subs)[1]) {
out <-
data.frame(temp %>% filter(key == Subs[Subnum, ]) %>% collect())
out[is.na(out)] <- 'NA'
out$tags <- 'NA'
out <-
dummy_cols(
out,
select_columns = "terms",
remove_selected_columns = FALSE,
omit_colname_prefix = TRUE
)
out <-
dummy_cols(
out,
select_columns = "tags",
remove_selected_columns = FALSE,
omit_colname_prefix = TRUE
)
if (Subnum == 1){
Out_table <- arrow_table(out)
} else {
Out_table <- concat_tables(Out_table, arrow_table(out))
}

Seems to proceed without any errors.

Though I am uncertain this will provide what I need in the long run if there are still NA values in the table (which should be swapped to 0 without pulling into memory if possible).

@amoeba
Copy link
Member

amoeba commented Dec 2, 2023

I think a reprex is needed here. Even if you can't share your input files, finding a minimal sample of your data.frames that reproduces the issue and sharing them some way (attachments, dput, etc) would be good. And maybe others will chime in here with ideas for figuring out what's going on here.

@TPDeramus
Copy link
Author

Yes unfortunately the data is very much PHI that can't be shared.
Working on a way to scrub/de-identify the to the original data characters can using substitutions in bash but that will take a moment on my end.

I appreciate your patience on this.

@TPDeramus
Copy link
Author

Had a discussion with the parties involved and the general consensus is that sharing the data is not an option and that due to the size, making sure everything is adequately removed is likely not feasible.

However, I was given the okay to set up an interactive meeting for troubleshooting if that's something anyone on the team would be open to.

@paleolimbot
Copy link
Member

Hmm...it looks like our full_join() has some custom bits on top of Arrow's join that involves coalsece:

https://github.com/apache/arrow/blob/main/r/R/dplyr-join.R#L95-L112

https://github.com/apache/arrow/blob/main/r/R/dplyr-join.R#L181-L239

It seems like https://github.com/apache/arrow/blob/main/r/R/dplyr-join.R#L223 is evaluating to a vctrs_extension_type() with a very strange ptype. My guess is that either coalesce_targets has 0 rows or right_names has size != 1, and the fact that we get a vctrs_extension_type heading into coalesce is symptom of that.

@TPDeramus
Copy link
Author

Interesting.

Anyway I could explore this via debugging?

@amoeba
Copy link
Member

amoeba commented Dec 12, 2023

Hi @TPDeramus, to try out @paleolimbot's idea, you should be able to use debug() which will let you give us some more diagnostic information. Since you're working with sensitive data, I'll leave it to you to anonymize or censor what you need.

If you could do these steps and report back that'd be helpful:

  1. After loading packages but prior to running your for loop or whatever code calls full_join, run debug(arrow:::post_join_projection). The prompt should return with no output.
  2. Run your code that executes the troublesome full_join
  3. Instead of the normal output, your prompt should change from > to Browse[2]> and, at least in RStudio, you'll see a new editor tab open with the body of post_join_projection.
  4. In the Browse[2] prompt, execute these statements and share their output here:
    • left_names
    • right_names
    • by
    • suffix
    • data.frame(left_index = match(by, left_names), right_index = match(by, right_names))

@paleolimbot
Copy link
Member

I dug into this a little, and don't think this is a problem with arrow. One of the inputs has a somewhat strange column type of structure(numeric(), class = NA_character_), and joining on that column is not going to work:

library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(dplyr, warn.conflicts = FALSE)

strange_object <- numeric()
class(strange_object) <- NA_character_

df <- data.frame(strange_object)
df |> 
  as_arrow_table() |> 
  full_join(df)
#> Error in `map_chr()` at r/R/dplyr.R:122:2:
#> ℹ In index: 1.
#> ℹ With name: strange_object.
#> Caused by error:
#> ! NotImplemented: Function 'coalesce' has no kernel matching input types (numeric(0)
#> attr(,"class")
#> [1] NA, numeric(0)
#> attr(,"class")
#> [1] NA)
#> Backtrace:
#>      ▆
#>   1. ├─base::tryCatch(...)
#>   2. │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>   3. │   ├─base (local) tryCatchOne(...)
#>   4. │   │ └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>   5. │   └─base (local) tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
#>   6. │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>   7. │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>   8. ├─base::withCallingHandlers(...)
#>   9. ├─base::saveRDS(...)
#>  10. ├─base::do.call(...)
#>  11. ├─base (local) `<fn>`(...)
#>  12. ├─global `<fn>`(input = base::quote("next-esok_reprex.R"))
#>  13. │ └─rmarkdown::render(input, quiet = TRUE, envir = globalenv(), encoding = "UTF-8")
#>  14. │   └─knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet)
#>  15. │     └─knitr:::process_file(text, output)
#>  16. │       ├─knitr:::handle_error(...)
#>  17. │       │ └─base::withCallingHandlers(...)
#>  18. │       ├─base::withCallingHandlers(...)
#>  19. │       ├─knitr:::process_group(group)
#>  20. │       └─knitr:::process_group.block(group)
#>  21. │         └─knitr:::call_block(x)
#>  22. │           └─knitr:::block_exec(params)
#>  23. │             └─knitr:::eng_r(options)
#>  24. │               ├─knitr:::in_input_dir(...)
#>  25. │               │ └─knitr:::in_dir(input_dir(), expr)
#>  26. │               └─knitr (local) evaluate(...)
#>  27. │                 └─evaluate::evaluate(...)
#>  28. │                   └─evaluate:::evaluate_call(...)
#>  29. │                     ├─evaluate (local) handle(...)
#>  30. │                     │ └─base::try(f, silent = TRUE)
#>  31. │                     │   └─base::tryCatch(...)
#>  32. │                     │     └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  33. │                     │       └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  34. │                     │         └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>  35. │                     ├─base::withCallingHandlers(...)
#>  36. │                     ├─base::withVisible(value_fun(ev$value, ev$visible))
#>  37. │                     └─knitr (local) value_fun(ev$value, ev$visible)
#>  38. │                       └─knitr (local) fun(x, options = options)
#>  39. │                         ├─base::withVisible(knit_print(x, ...))
#>  40. │                         ├─knitr::knit_print(x, ...)
#>  41. │                         └─knitr:::knit_print.default(x, ...)
#>  42. │                           └─evaluate (local) normal_print(x)
#>  43. │                             ├─base::print(x)
#>  44. │                             └─arrow:::print.arrow_dplyr_query(x)
#>  45. │                               └─purrr::map_chr(...) at r/R/dplyr.R:122:2
#>  46. │                                 └─purrr:::map_("character", .x, .f, ..., .progress = .progress)
#>  47. │                                   ├─purrr:::with_indexed_errors(...)
#>  48. │                                   │ └─base::withCallingHandlers(...)
#>  49. │                                   ├─purrr:::call_with_cleanup(...)
#>  50. │                                   └─arrow (local) .f(.x[[i]], ...)
#>  51. │                                     ├─base::paste0(...) at r/R/dplyr.R:129:6
#>  52. │                                     └─expr$type(schm) at r/R/dplyr.R:129:6
#>  53. │                                       └─arrow:::compute___expr__type(self, schema) at r/R/expression.R:54:6
#>  54. └─base::.handleSimpleError(...) at r/R/arrowExports.R:1152:2
#>  55.   └─purrr (local) h(simpleError(msg, call))
#>  56.     └─cli::cli_abort(...)
#>  57.       └─rlang::abort(...)

Created on 2024-01-03 with reprex v2.0.2

You might be able to detect that column by doing something like:

strange_object <- numeric()
class(strange_object) <- NA_character_

df <- data.frame(strange_object)
vapply(lapply(df, class), function(x) any(is.na(x)), logical(1))
#> strange_object 
#>           TRUE

Created on 2024-01-03 with reprex v2.0.2

@TPDeramus
Copy link
Author

I'll look into that and provide an update when I have a moment.

That said, if the object is something preventing joining, do you think there's a relatively straightforward way to edit the schema to fix this?

Thanks so much @paleolimbot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants