Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] processing nanoarrow streams with geoarrow column #78

Closed
JosiahParry opened this issue Nov 26, 2023 · 2 comments · Fixed by #79
Closed

[R] processing nanoarrow streams with geoarrow column #78

JosiahParry opened this issue Nov 26, 2023 · 2 comments · Fixed by #79

Comments

@JosiahParry
Copy link

JosiahParry commented Nov 26, 2023

I have the ability to return a nanoarrow stream with a geoarrow geometry array. I'd like to be able to take the stream and turn it into a tabular data structure (data.frame-esque of any variety). Is it possible with geoarrow as it is today, to take this and turn it into a data.frame with a geoarrow array column?

Below is a small reprex using an R package im trying to develop using arrow-rs and geoarrow-rs
https://github.com/JosiahParry/serde_esri

library(serdesri)

url <- "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/ACS_Population_by_Race_and_Hispanic_Origin_Boundaries/FeatureServer/2/query?where=1=1&outFields=objectid&resultRecordCount=10&f=json"

req <- httr2::request(url)
resp <- httr2::req_perform(req)
json <- httr2::resp_body_string(resp)

stream <- parse_esri_json_str(json, 2)
stream
#> <nanoarrow_array_stream struct<OBJECTID: int64, geometry: geoarrow.polygon{large_list<rings: large_list<vertices: fixed_size_list(2)<xy: double>>>}>>
#>  $ get_schema:function ()  
#>  $ get_next  :function (schema = x$get_schema(), validate = TRUE)  
#>  $ release   :function ()

df <- as.data.frame(stream)
#> Warning in warn_unregistered_extension_type(x): geometry: Converting unknown
#> extension geoarrow.polygon{large_list<rings: large_list<vertices:
#> fixed_size_list(2)<xy: double>>>} as storage type
#> Warning in warn_unregistered_extension_type(storage): geometry: Converting
#> unknown extension geoarrow.polygon{large_list<rings: large_list<vertices:
#> fixed_size_list(2)<xy: double>>>} as storage type

str(df, 1)
#> 'data.frame':    10 obs. of  2 variables:
#>  $ OBJECTID: num  1 2 3 4 5 6 7 8 9 10
#>  $ geometry: list<list<list<dbl>>> [1:10]
@paleolimbot
Copy link
Contributor

I'm not quite there yet but the PR that implements most of it is here and I'll try to smooth out the rough edges in the next few days #75 .

Basically, it will let a list(nanoarrow_array) masquerade as a data.frame column/vctr. It will survive slicing (e.g., head(1:X) but not rearranging (e.g., arbitrary filter/take) quite yet, but that will let you do as.data.frame() and then choose the destination you want (e.g., sf::st_as_sfc(), or whatever).

Also, geoarrow for R doesn't quite do 64-bit offsets (i.e., the large_list bit of your output above). If there's an option to use 32-bit offsets in geoarrow-rs that is pretty much required right now for anything geoarrow-c based. You'll also get better performance if you use separated/struct coordinates in R (because it more closely matches wk::xy() and sf::st_sfc()'s memory layout).

paleolimbot added a commit that referenced this issue Dec 2, 2023
Closes #78.

Works, but because of a limitation in nanoarrow the R package, it can
only convert one chunk at a time:

``` r
library(arrow, warn.conflicts = FALSE)
library(geoarrow)

tmp <- tempfile()
curl::curl_download(
  "https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-water_junc.arrow",
  tmp
)

table <- read_feather(tmp, col_select = c("geometry"), as_data_frame = FALSE)
(stream <- nanoarrow::as_nanoarrow_array_stream(table))
#> <nanoarrow_array_stream struct<geometry: geoarrow.multipoint{list<points: struct<x: double, y: double>>}>>
#>  $ get_schema:function ()  
#>  $ get_next  :function (schema = x$get_schema(), validate = TRUE)  
#>  $ release   :function ()
tibble::as_tibble(stream$get_next())
#> # A tibble: 65,536 × 1
#>    geometry                                     
#>    <grrw_vct>                                   
#>  1 <MULTIPOINT (301431.9676173 4818251.5775598)>
#>  2 <MULTIPOINT (283766.8647915 4818265.9772479)>
#>  3 <MULTIPOINT (305785.7673356 4818288.1786451)>
#>  4 <MULTIPOINT (301149.2665608 4818332.6785636)>
#>  5 <MULTIPOINT (305721.3693198 4818378.3766595)>
#>  6 <MULTIPOINT (301778.4666771 4818385.3765993)>
#>  7 <MULTIPOINT (284873.7639941 4818408.0773191)>
#>  8 <MULTIPOINT (305661.9683072 4818412.3776635)>
#>  9 <MULTIPOINT (300857.8665021 4818425.9776693)>
#> 10 <MULTIPOINT (291681.7660744 4818435.4774984)>
#> # ℹ 65,526 more rows
```

---------

Co-authored-by: Anthony North <anthony.jl.north@gmail.com>
@JosiahParry
Copy link
Author

image

amazing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants