-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(r): Add infrastructure methods for dealing with chunked values #75
Conversation
r/geoarrow/R/vctr.R
Outdated
new_geoarrow_vctr <- function(chunks, schema) { | ||
offsets <- .Call(geoarrow_c_vctr_chunk_offsets, chunks) | ||
structure( | ||
seq_len(offsets[length(offsets)]), | ||
schema = schema, | ||
chunks = chunks, | ||
chunk_offsets = offsets, | ||
class = c("geoarrow_vctr", "wk_vctr") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we store just the array stream and wk attributes (crs, geodesic)? I think we can retrieve schema and offsets from the array stream when they're required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure we need list()
of nanoarrow_array
here and offsets so that we can efficiently subset (see the bit I added). Once you have a stream, you don't know how many chunks or elements there are and you loose random access. I definitely may have missed something though!
For crs and geodesic I'd envisioned keeping those with the schema
(i.e., modifying them via wk_set_crs()
or wk_set_geodesic()
would modify the schema instead of set an attribute on the vctr). Also not sure I thought that all the way through or not!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For crs and geodesic I'd envisioned keeping those with the schema (i.e., modifying them via wk_set_crs() or wk_set_geodesic() would modify the schema instead of set an attribute on the vctr). Also not sure I thought that all the way through or not!
That makes sense. I'd forgotten these are available in the schema.
I'm pretty sure we need list() of nanoarrow_array here and offsets so that we can efficiently subset (see the bit I added). Once you have a stream, you don't know how many chunks or elements there are and you loose random access. I definitely may have missed something though!
Ok.
r/geoarrow/R/vctr.R
Outdated
stop("Can't resolve non-slice geoarrow_vctr to nanoarrow_array_stream") | ||
} | ||
|
||
chunk1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easy fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it was a bit of a rabbit hole to finish my thought here, but this was basically the place where the slicing and subsetting of the chunks had to happen.
Ok, I finished the thought that I started. Definitely let me know if this doesn't make sense...it's only ever been in my head! |
|
||
# Calculate first and last slice information | ||
first_index <- slice[1] - 1L | ||
end_index <- first_index + slice[2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is end_index always max(offsets)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be max(offsets)
if the slice were 1:length(n)
, but most of what I'm trying to support here is something like head(x, 10)
(i.e., what happens when you print).
) | ||
} | ||
|
||
# Calculate first and last slice information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I suggest 1-indexing for these variables? We need 1-indexed values for slicing offsets and chunks. We need 0-indexed offset nanoarrow::nanoarrow_array_modify()
calls.
I think 1-indexing simplifies this slightly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they both have their advantages...you are right that mixing them introduces a lot of confusion. I think the main motivation was to avoid too many +1
and -1
s in C code, where these things are being done in a tight loop. At the R level the +1s and -1s are not (or shouldn't be!) on anything other than a scalar.
|
||
# Utilities for vctr methods | ||
|
||
vctr_resolve_chunk <- function(x, offsets) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to support a vector x
, or is a scalar sufficient?
Reason I ask: I think we can resolve chunks with a trivial R function for this for scalar x
resolve_chunk <- function(x, offsets) {
if (x < offsets[1L] || x >= offsets[length(offsets)])
return(NA_integer_)
sum(x >= offsets[-length(offsets)]) - 1L
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A scalar is probably fine, although the binary search is I think important because of the log()
complexity. The vectorizing was useful for testing but is probably not important for performance.
r/geoarrow/src/r-vctr.c
Outdated
} | ||
|
||
SEXP geoarrow_c_vctr_as_slice(SEXP indices_sexp) { | ||
if (TYPEOF(indices_sexp) == INTSXP) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inverting this condition & early return reduces the nesting here
if (TYPEOF(indices_sexp) == INTSXP) { | |
if (TYPEOF(indices_sexp) != INTSXP) return R_NilValue; |
return chunk_indices_sexp; | ||
} | ||
|
||
SEXP geoarrow_c_vctr_as_slice(SEXP indices_sexp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a slice the starting 1-index, and length (ignoring the 0-length special case)?
Could we simplify this logic in 2 parts?
- check the indices vector is an integer sequence, or 0-length
- return the 2-vector
c(start_index, length)
Assuming I've not missed anything, both of these have simple and fast R implementations if the C-level implementation isn't needed for anything else.
is_sequence <- all(indices == seq.int(indices[1], indices[1] + length(indices) - 1L))
slice <- if (length(indices) == 0) c(NA_integer_, 0L) else c(indices[1], length(indices))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is again me being a little paranoid about huge input. The main thing that can happen here is that indices
is 1:1e9
or something: R represents that as ALTREP. The C versions here avoids altrep expansion:
indices <- 1:1e6
.Internal(inspect(indices))
#> @12aa00770 13 INTSXP g0c0 [REF(65535)] 1 : 1000000 (compact)
is_sequence <- all(indices == seq.int(indices[1], indices[1] + length(indices) - 1L))
.Internal(inspect(indices))
#> @12aa00770 13 INTSXP g0c0 [REF(65535)] 1 : 1000000 (expanded)
indices <- 1:1e6
.Internal(inspect(indices))
#> @12af9b5d0 13 INTSXP g0c0 [REF(65535)] 1 : 1000000 (compact)
geoarrow:::vctr_as_slice(indices)
#> Error in eval(expr, envir, enclos): object 'vctr_as_slice' not found
.Internal(inspect(indices))
#> @12af9b5d0 13 INTSXP g1c0 [MARK,REF(65535)] 1 : 1000000 (compact)
indices <- 1:1e9
bench::mark(
r = {
is_sequence <- all(indices == seq.int(indices[1], indices[1] + length(indices) - 1L))
slice <- if (length(indices) == 0) c(NA_integer_, 0L) else c(indices[1], length(indices))
},
c = geoarrow:::vctr_as_slice(indices)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 r 5.75s 5.75s 0.174 11.18GB 0.521
#> 2 c 1.28s 1.28s 0.778 1.59MB 0
|
||
#include "geoarrow.h" | ||
|
||
SEXP geoarrow_c_vctr_chunk_offsets(SEXP array_list) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect array_list
to be large? If not, should this logic be moved to R? The overhead should be tiny if array_list
is small.
chunk_lengths <- vapply(chunks, function(x) x$length, integer(1L))
c(0L, cumsum(chunk_lengths))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that the list is usually small. There can be situations where the chunk size is 1024 and there are 100 million results, though. In the Arrow R package we do some occasional looping over columns and chunks and it's caused some performance problems that have been difficult to fix. I know it's a bit overkill for most situations, but I do want to minimize looping over chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely not thinking of 100 million length geometry vectors!
I finally got around to reading. Code makes sense. The indexing logic in I've only found 1 possible bug (or my misunderstanding), in I haven't read the unit tests. |
94f05c9
to
80cb3bd
Compare
Thank you for taking a look, and sorry for taking so long to circle back! I got this to a place where my next PR is unblocked (Arrow R package integration), although I do think a few of the internal could be simplified/documented/refactored to either address your comments above or ensure that the next reader has to spend less time to figure out what's going on! |
Not polished yet! I had intended to start on the actual vctr bit a few weeks ago but didn't get far past some very boring details.