Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(r): Add infrastructure methods for dealing with chunked values #75

Merged
merged 10 commits into from
Nov 26, 2023

Conversation

paleolimbot
Copy link
Contributor

Not polished yet! I had intended to start on the actual vctr bit a few weeks ago but didn't get far past some very boring details.

Comment on lines 12 to 24
new_geoarrow_vctr <- function(chunks, schema) {
offsets <- .Call(geoarrow_c_vctr_chunk_offsets, chunks)
structure(
seq_len(offsets[length(offsets)]),
schema = schema,
chunks = chunks,
chunk_offsets = offsets,
class = c("geoarrow_vctr", "wk_vctr")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we store just the array stream and wk attributes (crs, geodesic)? I think we can retrieve schema and offsets from the array stream when they're required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we need list() of nanoarrow_array here and offsets so that we can efficiently subset (see the bit I added). Once you have a stream, you don't know how many chunks or elements there are and you loose random access. I definitely may have missed something though!

For crs and geodesic I'd envisioned keeping those with the schema (i.e., modifying them via wk_set_crs() or wk_set_geodesic() would modify the schema instead of set an attribute on the vctr). Also not sure I thought that all the way through or not!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For crs and geodesic I'd envisioned keeping those with the schema (i.e., modifying them via wk_set_crs() or wk_set_geodesic() would modify the schema instead of set an attribute on the vctr). Also not sure I thought that all the way through or not!

That makes sense. I'd forgotten these are available in the schema.

I'm pretty sure we need list() of nanoarrow_array here and offsets so that we can efficiently subset (see the bit I added). Once you have a stream, you don't know how many chunks or elements there are and you loose random access. I definitely may have missed something though!

Ok.

stop("Can't resolve non-slice geoarrow_vctr to nanoarrow_array_stream")
}

chunk1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easy fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it was a bit of a rabbit hole to finish my thought here, but this was basically the place where the slicing and subsetting of the chunks had to happen.

@paleolimbot paleolimbot marked this pull request as ready for review November 18, 2023 02:58
@paleolimbot
Copy link
Contributor Author

Ok, I finished the thought that I started. Definitely let me know if this doesn't make sense...it's only ever been in my head!

r/geoarrow/R/vctr.R Outdated Show resolved Hide resolved

# Calculate first and last slice information
first_index <- slice[1] - 1L
end_index <- first_index + slice[2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is end_index always max(offsets)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be max(offsets) if the slice were 1:length(n), but most of what I'm trying to support here is something like head(x, 10) (i.e., what happens when you print).

)
}

# Calculate first and last slice information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I suggest 1-indexing for these variables? We need 1-indexed values for slicing offsets and chunks. We need 0-indexed offset nanoarrow::nanoarrow_array_modify() calls.

I think 1-indexing simplifies this slightly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they both have their advantages...you are right that mixing them introduces a lot of confusion. I think the main motivation was to avoid too many +1 and -1s in C code, where these things are being done in a tight loop. At the R level the +1s and -1s are not (or shouldn't be!) on anything other than a scalar.


# Utilities for vctr methods

vctr_resolve_chunk <- function(x, offsets) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support a vector x, or is a scalar sufficient?

Reason I ask: I think we can resolve chunks with a trivial R function for this for scalar x

resolve_chunk <- function(x, offsets) {
  if (x < offsets[1L] || x >= offsets[length(offsets)])
    return(NA_integer_)

  sum(x >= offsets[-length(offsets)]) - 1L
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A scalar is probably fine, although the binary search is I think important because of the log() complexity. The vectorizing was useful for testing but is probably not important for performance.

}

SEXP geoarrow_c_vctr_as_slice(SEXP indices_sexp) {
if (TYPEOF(indices_sexp) == INTSXP) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inverting this condition & early return reduces the nesting here

Suggested change
if (TYPEOF(indices_sexp) == INTSXP) {
if (TYPEOF(indices_sexp) != INTSXP) return R_NilValue;

return chunk_indices_sexp;
}

SEXP geoarrow_c_vctr_as_slice(SEXP indices_sexp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a slice the starting 1-index, and length (ignoring the 0-length special case)?

Could we simplify this logic in 2 parts?

  • check the indices vector is an integer sequence, or 0-length
  • return the 2-vector c(start_index, length)

Assuming I've not missed anything, both of these have simple and fast R implementations if the C-level implementation isn't needed for anything else.

is_sequence <- all(indices == seq.int(indices[1], indices[1] + length(indices) - 1L))
slice <- if (length(indices) == 0) c(NA_integer_, 0L) else c(indices[1], length(indices))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is again me being a little paranoid about huge input. The main thing that can happen here is that indices is 1:1e9 or something: R represents that as ALTREP. The C versions here avoids altrep expansion:

indices <- 1:1e6
.Internal(inspect(indices))
#> @12aa00770 13 INTSXP g0c0 [REF(65535)]  1 : 1000000 (compact)
is_sequence <- all(indices == seq.int(indices[1], indices[1] + length(indices) - 1L))
.Internal(inspect(indices))
#> @12aa00770 13 INTSXP g0c0 [REF(65535)]  1 : 1000000 (expanded)

indices <- 1:1e6
.Internal(inspect(indices))
#> @12af9b5d0 13 INTSXP g0c0 [REF(65535)]  1 : 1000000 (compact)
geoarrow:::vctr_as_slice(indices)
#> Error in eval(expr, envir, enclos): object 'vctr_as_slice' not found
.Internal(inspect(indices))
#> @12af9b5d0 13 INTSXP g1c0 [MARK,REF(65535)]  1 : 1000000 (compact)
indices <- 1:1e9
bench::mark(
  r = {
    is_sequence <- all(indices == seq.int(indices[1], indices[1] + length(indices) - 1L))
    slice <- if (length(indices) == 0) c(NA_integer_, 0L) else c(indices[1], length(indices))
  },
  c = geoarrow:::vctr_as_slice(indices)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 r             5.75s    5.75s     0.174   11.18GB    0.521
#> 2 c             1.28s    1.28s     0.778    1.59MB    0


#include "geoarrow.h"

SEXP geoarrow_c_vctr_chunk_offsets(SEXP array_list) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect array_list to be large? If not, should this logic be moved to R? The overhead should be tiny if array_list is small.

chunk_lengths <- vapply(chunks, function(x) x$length, integer(1L))
c(0L, cumsum(chunk_lengths))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that the list is usually small. There can be situations where the chunk size is 1024 and there are 100 million results, though. In the Arrow R package we do some occasional looping over columns and chunks and it's caused some performance problems that have been difficult to fix. I know it's a bit overkill for most situations, but I do want to minimize looping over chunks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely not thinking of 100 million length geometry vectors!

@anthonynorth
Copy link
Contributor

anthonynorth commented Nov 22, 2023

Ok, I finished the thought that I started. Definitely let me know if this doesn't make sense...it's only ever been in my head!

I finally got around to reading. Code makes sense. The indexing logic in as_nanoarrow_array_stream.geoarrow_vctr() took me a moment or 5 to understand though.

I've only found 1 possible bug (or my misunderstanding), in as_nanoarrow_array_stream.geoarrow_vctr(). I've left a suggestion for that.

I haven't read the unit tests.

@paleolimbot
Copy link
Contributor Author

Thank you for taking a look, and sorry for taking so long to circle back! I got this to a place where my next PR is unblocked (Arrow R package integration), although I do think a few of the internal could be simplified/documented/refactored to either address your comments above or ensure that the next reader has to spend less time to figure out what's going on!

@paleolimbot paleolimbot merged commit 7ae1089 into geoarrow:main Nov 26, 2023
@paleolimbot paleolimbot deleted the r-vctr-rough branch November 26, 2023 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants