Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align results with those of labs/benchmarks and the Conbench API #116

Merged
merged 12 commits into from
Dec 6, 2022
6 changes: 5 additions & 1 deletion .github/workflows/R-CMD-check.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
on: [push, pull_request]
on:
push:
branches:
- main
pull_request:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for only running on push to main? I definitely from time to time appreciate having CI on branches even before they get to the PR stage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were running duplicate builds of everything on PRs because they're both branches and pull requests, so I copied this from one of our other repos. We could drop lines 4-5 (I think?) if we want builds on all pushes to branches

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were running duplicate builds of everything on PRs because they're both branches and pull requests

Could you point me to an example of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On your PR, if you expand the builds, there are two versions of each, one appended with (pull_request), the other with (push):

image

Copy link
Contributor

@jonkeane jonkeane Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, those are testing two different things though (check out the details of the checkouts). The push is the HEAD of the branch, the pull_request is the merge of HEAD of branch + main. For that specific PR they end up being identical code wise (since I rebased the branch on main a few moments ago), but they can be and frequently are quite different

push

Checking out the ref
  /usr/local/bin/git checkout --progress --force -B duckdb-ext refs/remotes/origin/duckdb-ext
  Switched to a new branch 'duckdb-ext'
  branch 'duckdb-ext' set up to track 'origin/duckdb-ext'.
/usr/local/bin/git log -1 --format='%H'
'14f9a4c5797ef2529328c050e512e18010737531'

https://github.com/voltrondata-labs/arrowbench/actions/runs/3605143011/jobs/6075232144#step:2:455

pull_request

Checking out the ref
  /usr/local/bin/git checkout --progress --force refs/remotes/pull/119/merge
  HEAD is now at 1dfb8c9 Merge 14f9a4c5797ef2529328c050e512e18010737531 into ada600dea8d6383b802be893b9e26b6b6b38792a
/usr/local/bin/git log -1 --format='%H'
'1dfb8c955aeb476fdc7e6dd055db57cacaa3f2b7'

https://github.com/voltrondata-labs/arrowbench/actions/runs/3605143166/jobs/6075232423#step:2:459

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, interesting. Though it seems like pull_request is what I'd expect aside from on the default branch. But I can revert.


name: R-CMD-check

Expand Down
12 changes: 7 additions & 5 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
Package: arrowbench
Type: Package
Title: Tools for Continuous and Interactive Benchmarking
Version: 0.1.0
Version: 0.2.0
Authors@R: c(
person("Neal", "Richardson", email = "neal.p.richardson@gmail.com", role = "aut"),
person("Jonathan", "Keane", email = "jkeane@gmail.com", role = c("aut", "cre"))
person("Jonathan", "Keane", email = "jkeane@gmail.com", role = c("aut", "cre")),
person("Edward", "Visel", email = "edward.visel@gmail.com", role = "aut", comment = c(ORCID = "0000-0002-2811-6254"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

)
Description: Tools for defining benchmarks, running them across a
range of parameters, and reporting their results in a standardized form.
Expand All @@ -14,6 +15,7 @@ Depends: R (>= 3.5.0)
Imports:
arrow,
bench,
callr,
dbplyr,
dplyr,
distro,
Expand All @@ -26,9 +28,9 @@ Imports:
R.utils,
sessioninfo,
utils,
uuid,
waldo,
withr,
callr
withr
Suggests:
testthat (>= 3.0.0),
archive,
Expand All @@ -43,7 +45,7 @@ Suggests:
RcppSimdJson,
readr,
vroom
RoxygenNote: 7.1.2
RoxygenNote: 7.2.2
Roxygen: list(markdown = TRUE, load = "source")
Collate:
'benchmark.R'
Expand Down
16 changes: 13 additions & 3 deletions R/benchmark.R
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,12 @@
#' single case and returning an integer version for the case, or `NULL` to not
#' append a version; `NA` will raise an error. Changes to version will break
#' conbench history for a case.
#' @param batch_id_fun A unary function which takes a dataframe of parameters and
#' returns a character vector to use as `batch_id` of length 1 or `nrow(params)`
#' @param tags_fun A unary function which takes a named list of setup parameters
#' for a single case and returns a named list of tags to send to Conbench for that
#' case. Can be overwritten to add static tags or postprocess parameters into more
#' readable forms.
#' @param packages_used function taking a `data.frame` of setup parameters and
#' returning a vector of R package names required
#' @param ... additional attributes or functions, possibly called in `setup()`.
Expand All @@ -100,6 +106,8 @@ Benchmark <- function(name,
teardown = TRUE,
valid_params = function(params) params,
case_version = function(params) NULL,
batch_id_fun = function(params) uuid(),
tags_fun = function(params) params,
packages_used = function(params) "arrow",
...) {
stopifnot(is.character(name))
Expand All @@ -113,6 +121,8 @@ Benchmark <- function(name,
teardown = substitute(teardown),
valid_params = valid_params,
case_version = case_version,
batch_id_fun = batch_id_fun,
tags_fun = tags_fun,
packages_used = packages_used,
...),
class = "Benchmark"
Expand Down Expand Up @@ -164,12 +174,12 @@ default_params <- function(bm, ...) {
#' error status
#'
#' @param run An instance of `BenchmarkResults` as returned by `run_benchmark`
#' or `BenchmarkResult` or `BenchmarkFailure` as returned by `run_one` and `run_bm`
#' or `BenchmarkResult` as returned by `run_one` and `run_bm`
#' @return a tibble
#' @export
get_params_summary <- function(run) {
if (!inherits(run, c("BenchmarkResults", "BenchmarkResult", "BenchmarkFailure"))) {
stop("run objects need to be of class BenchmarkResults, BenchmarkResult, or BenchmarkFailure")
if (!inherits(run, c("BenchmarkResults", "BenchmarkResult"))) {
stop("run objects need to be of class BenchmarkResults or BenchmarkResult")
}
run$params_summary
}
2 changes: 1 addition & 1 deletion R/bm-dataset-taxi-parquet.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#' * `query` Name of a known query to run; see `dataset_taxi_parquet$cases`
#'
#' @export
dataset_taxi_parquet <- Benchmark("dataset_taxi_parquet",
dataset_taxi_parquet <- Benchmark("partitioned-dataset-filter",
setup = function(query = names(dataset_taxi_parquet$cases)) {
library("dplyr", warn.conflicts = FALSE)
dataset <- ensure_dataset("taxi_parquet")
Expand Down
9 changes: 7 additions & 2 deletions R/bm-df-to-table.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,13 @@
#' * `source` A known-file id to use (it will be read in to a data.frame first)
#'
#' @export
df_to_table <- Benchmark("df_to_table",
setup = function(source = names(known_sources)) {
df_to_table <- Benchmark("dataframe-to-table",
setup = function(source = c("chi_traffic_2020_Q1",
"type_strings",
"type_dict",
"type_integers",
"type_floats",
"type_nested")) {
alistaire47 marked this conversation as resolved.
Show resolved Hide resolved
source <- ensure_source(source)
result_dim <- get_source_attr(source, "dim")
# Make sure that we're not (accidentally) creating altrep vectors which will
Expand Down
58 changes: 29 additions & 29 deletions R/bm-read-file.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,35 +2,35 @@
#'
#' @section Parameters:
#' * `source` A known-file id, or a CSV(?) file path to read in
#' * `format` One of `c("parquet", "feather", "fst")`
#' * `file_type` One of `c("parquet", "feather", "fst")`
#' * `compression` One of the values: `r paste(known_compressions, collapse = ", ")`
#' * `output` One of `c("arrow_table", "data_frame")`
#' * `output_type` One of `c("arrow_table", "data_frame")`
Comment on lines -5 to +7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine changing these here if it's easiest to go this direction, though I kinda wish we went the other way (but understand migrating out of that would be a pain in the ass...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I definitely don't love these changes, but they're the easiest way to keep consistency with history. There's nothing stopping us from changing back and migrating the database in the future if we want, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and if we do decided to do it in the future, doing it separate from these specific changes will make it easier to both do and to diagnose if something goes wrong

#'
#' @export
read_file <- Benchmark("read_file",
setup = function(source = names(known_sources),
read_file <- Benchmark("file-read",
setup = function(source = c("fanniemae_2016Q4", "nyctaxi_2010-01"),
# TODO: break out feather_v1 and feather_v2, feather_v2 only in >= 0.17
format = c("parquet", "feather"),
file_type = c("parquet", "feather"),
compression = c("uncompressed", "snappy", "lz4"),
output = c("arrow_table", "data_frame")) {
# format defaults to parquet or feather, but can accept fst as well
format <- match.arg(format, c("parquet", "feather", "fst"))
# the output defaults are retrieved from the function definition (arrow_table and data_frame)
output <- match.arg(output)
output_type = c("arrow_table", "data_frame")) {
# file_type defaults to parquet or feather, but can accept fst as well
file_type <- match.arg(file_type, c("parquet", "feather", "fst"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need {fst} in here anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still run it for comparison if we want, though AFAIK we don't have a policy on when we actually do run external packages and compare against them. It might be nice to run these when there's an Arrow release and include them for context in that report? Not priority, but maybe useful context.

Or was {fst} deprecated or something? I thought I remembered something like that, but now I don't see it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe you are thinking disk.frame. And I think the subtext here is that it really doesn't matter too much to include it so why remove. Convinced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yep, that was it! Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we might need to drop support for fst when we move to datalogistik (or do something slightly different if we were to use it), but for now let's keep it here

# the output_type defaults are retrieved from the function definition (arrow_table and data_frame)
output_type <- match.arg(output_type)

# ensure that we have the right kind of file available
input_file <- ensure_format(source, format, compression)
# retrieve the dimnesions for run-checking after the benchmark
input_file <- ensure_format(source, file_type, compression)
# retrieve the dimensions for run-checking after the benchmark
result_dim <- get_source_attr(source, "dim")

# put the necesary variables into a BenchmarkEnvironment to be used when the
# put the necessary variables into a BenchmarkEnvironment to be used when the
# benchmark is running.
BenchEnvironment(
# get the correct read function for the input format
read_func = get_read_function(format),
# get the correct read function for the input file_type
read_func = get_read_function(file_type),
input_file = input_file,
result_dim = result_dim,
as_data_frame = output == "data_frame"
as_data_frame = output_type == "data_frame"
)
},
# delete the results before each iteration
Expand All @@ -48,44 +48,44 @@ read_file <- Benchmark("read_file",
},
# validate that the parameters given are compatible
valid_params = function(params) {
# make sure that the format and the compression is compatible
# and fst doesn't have arrow_table output
drop <- !validate_format(params$format, params$compression) |
params$output == "arrow_table" & params$format == "fst"
# make sure that the file_type and the compression is compatible
# and fst doesn't have arrow_table output_type
drop <- !validate_format(params$file_type, params$compression) |
params$output_type == "arrow_table" & params$file_type == "fst"
params[!drop,]
},
# packages used when specific formats are used
# packages used when specific file_types are used
packages_used = function(params) {
pkg_map <- c(
"feather" = "arrow",
"parquet" = "arrow",
"fst" = "fst"
)
pkg_map[params$format]
pkg_map[params$file_type]
}
)

#' Get a reader
#'
#' @param format what format to read
#' @param file_type what file_type to read
#'
#' @return the read function to use
#' @export
get_read_function <- function(format) {
get_read_function <- function(file_type) {
pkg_map <- c(
"feather" = "arrow",
"parquet" = "arrow",
"fst" = "fst"
)
library(pkg_map[[format]], character.only = TRUE, warn.conflicts = FALSE)
library(pkg_map[[file_type]], character.only = TRUE, warn.conflicts = FALSE)

if (format == "feather") {
if (file_type == "feather") {
return(function(...) arrow::read_feather(...))
} else if (format == "parquet") {
} else if (file_type == "parquet") {
return(function(...) arrow::read_parquet(...))
} else if (format == "fst") {
} else if (file_type == "fst") {
return(function(..., as_data_frame) fst::read_fst(...))
} else {
stop("Unsupported format: ", format, call. = FALSE)
stop("Unsupported file_type: ", file_type, call. = FALSE)
}
}
14 changes: 13 additions & 1 deletion R/bm-tpc-h.R
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
#'
#' @importFrom waldo compare
#' @export
tpc_h <- Benchmark("tpc_h",
tpc_h <- Benchmark("tpch",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MINOR]: Since you are doing this, I wonder if we should also change tpc_h everywhere to tpch. This would include instances in this file and here:

tpc_h_queries <- list()
tpc_h_queries[[1]] <- function(input_func, collect_func = dplyr::collect, con = NULL) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open if we want to, though it may (?) conflict in some places with the work to implement datalogistik in #113. These name changes are a pain, but necessary because we've been sending the Python names to Conbench in tags.name. Some of the Python ones have - instead of _ (e.g. file-read), though, so we can't make object names exactly match name elements (and eventual tags) unless we go the other way and modify everything in the database back to the R names, which is possible, but harder. Or we could quote object names in R with `` everywhere and make an object called file-read, but that just seems like one of those R things that's

jurassic park could/should gif

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These name changes are a pain

100%. Thank you for doing them!
I'll leave it up to you. It is not a material difference and really just reflects a slight nit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another example where I wish we had called it tpc-h or tpc_h in what was stored in conbench in the first place. Let's go with tpch for now to reduce the number of migration bits we need, but another good example of how the spirit of this project of getting arrowbench -> conbench more directly will help make sure we put up what we want and not what another shim layer between had

setup = function(engine = "arrow",
query_id = 1:22,
format = c("native", "parquet"),
Expand Down Expand Up @@ -135,6 +135,18 @@ tpc_h <- Benchmark("tpc_h",
( params$engine == "dplyr" & params$format == "native" )
params[!drop,]
},
batch_id_fun = function(params) {
batch_id <- uuid()
paste0(batch_id, "-", params$scale_factor, substr(params$format, 1, 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is effectively copy|paste from voltrondata-labs/benchmarks but ouch that's kinda gross. Would you mind adding a todo here to make this be a bit nicer (probably need to properly stringify the scale factor and also have the format be full(er)?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. We could also change it now, if you like; the only thing I can think of that might (? not sure) care about consistency here is @boshek's release report.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put in a TODO for now

},
tags_fun = function(params) {
# for consistency with runs through voltrondata-labs/benchmarks
params$query_id <- sprintf("TPCH-%02d", params$query_id)
if (params$output == "data_frame") {
params$output <- NULL
}
params
},
# packages used when specific formats are used
packages_used = function(params) {
c(params$engine, "dplyr", "lubridate")
Expand Down
70 changes: 31 additions & 39 deletions R/bm-write-file.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,45 @@
#'
#' @section Parameters:
#' * `source` A known-file id, or a CSV(?) file path to read in
#' * `format` One of `c("parquet", "feather", "fst")`
#' * `file_type` One of `c("parquet", "feather", "fst")`
#' * `compression` One of the values: `r paste(known_compressions, collapse = ", ")`
#' * `input` One of `c("arrow_table", "data_frame")`
#' * `input_type` One of `c("arrow_table", "data_frame")`
#'
#' @export
write_file <- Benchmark("write_file",
setup = function(source = names(known_sources),
format = c("parquet", "feather"),
write_file <- Benchmark("file-write",
setup = function(source = c("fanniemae_2016Q4", "nyctaxi_2010-01"),
file_type = c("parquet", "feather"),
compression = c("uncompressed", "snappy", "lz4"),
input = c("arrow_table", "data_frame")) {
input_type = c("arrow_table", "data_frame")) {
# source defaults are retrieved from the function definition (all available
# known_sources) and then read the source in as a data.frame
source <- ensure_source(source)
df <- read_source(source, as_data_frame = match.arg(input) == "data_frame")
# format defaults to parquet or feather, but can accept fst as well
format <- match.arg(format, c("parquet", "feather", "fst"))
df <- read_source(source, as_data_frame = match.arg(input_type) == "data_frame")
# file_type defaults to parquet or feather, but can accept fst as well
file_type <- match.arg(file_type, c("parquet", "feather", "fst"))

# Map string param name to functions
write_func <- get_write_function(format, compression)
get_write_func <- function(file_type, compression) {
force(compression)
if (file_type == "feather") {
return(function(...) arrow::write_feather(..., compression = compression))
} else if (file_type == "parquet") {
return(function(...) arrow::write_parquet(..., compression = compression))
} else if (file_type == "fst") {
# fst is always zstd, just a question of what level of compression
level <- ifelse(compression == "uncompressed", 0, 50)
return(function(...) fst::write_fst(..., compress = level))
} else {
stop("Unsupported file_type: ", file_type, call. = FALSE)
}
}
write_func <- get_write_func(file_type, compression)

# put the necessary variables into a BenchmarkEnvironment to be used when
# the benchmark is running.
BenchEnvironment(
write_func = write_func,
format = format,
file_type = file_type,
source = source,
df = df
)
Expand All @@ -46,41 +60,19 @@ write_file <- Benchmark("write_file",
},
# validate that the parameters given are compatible
valid_params = function(params) {
# make sure that the format and the compression is compatible
# and fst doesn't have arrow_table input
drop <- !validate_format(params$format, params$compression) |
params$format == "fst" & params$input == "arrow_table"
# make sure that the file_type and the compression is compatible
# and fst doesn't have arrow_table input_type
drop <- !validate_format(params$file_type, params$compression) |
params$file_type == "fst" & params$input_type == "arrow_table"
params[!drop,]
},
# packages used when specific formats are used
# packages used when specific file_types are used
packages_used = function(params) {
pkg_map <- c(
"feather" = "arrow",
"parquet" = "arrow",
"fst" = "fst"
)
pkg_map[params$format]
pkg_map[params$file_type]
}
)

#' Get a writer
#'
#' @param format format to write
#' @param compression compression to use
#'
#' @return the write function to use
#' @export
get_write_function <- function(format, compression) {
force(compression)
if (format == "feather") {
return(function(...) arrow::write_feather(..., compression = compression))
} else if (format == "parquet") {
return(function(...) arrow::write_parquet(..., compression = compression))
} else if (format == "fst") {
# fst is always zstd, just a question of what level of compression
level <- ifelse(compression == "uncompressed", 0, 50)
return(function(...) fst::write_fst(..., compress = level))
} else {
stop("Unsupported format: ", format, call. = FALSE)
}
}
Loading