-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align results with those of labs/benchmarks and the Conbench API #116
Changes from 7 commits
83bf24f
8c4f5d8
fce36fd
57771fe
d79a587
a0de1de
1019e29
06e8442
3717347
c6c5f2f
f09da47
dbaea1e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,8 @@ | ||
on: [push, pull_request] | ||
on: | ||
push: | ||
branches: | ||
- main | ||
pull_request: | ||
|
||
name: R-CMD-check | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,11 @@ | ||
Package: arrowbench | ||
Type: Package | ||
Title: Tools for Continuous and Interactive Benchmarking | ||
Version: 0.1.0 | ||
Version: 0.2.0 | ||
Authors@R: c( | ||
person("Neal", "Richardson", email = "neal.p.richardson@gmail.com", role = "aut"), | ||
person("Jonathan", "Keane", email = "jkeane@gmail.com", role = c("aut", "cre")) | ||
person("Jonathan", "Keane", email = "jkeane@gmail.com", role = c("aut", "cre")), | ||
person("Edward", "Visel", email = "edward.visel@gmail.com", role = "aut", comment = c(ORCID = "0000-0002-2811-6254")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🎉 |
||
) | ||
Description: Tools for defining benchmarks, running them across a | ||
range of parameters, and reporting their results in a standardized form. | ||
|
@@ -14,6 +15,7 @@ Depends: R (>= 3.5.0) | |
Imports: | ||
arrow, | ||
bench, | ||
callr, | ||
dbplyr, | ||
dplyr, | ||
distro, | ||
|
@@ -26,9 +28,9 @@ Imports: | |
R.utils, | ||
sessioninfo, | ||
utils, | ||
uuid, | ||
waldo, | ||
withr, | ||
callr | ||
withr | ||
Suggests: | ||
testthat (>= 3.0.0), | ||
archive, | ||
|
@@ -43,7 +45,7 @@ Suggests: | |
RcppSimdJson, | ||
readr, | ||
vroom | ||
RoxygenNote: 7.1.2 | ||
RoxygenNote: 7.2.2 | ||
Roxygen: list(markdown = TRUE, load = "source") | ||
Collate: | ||
'benchmark.R' | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,35 +2,35 @@ | |
#' | ||
#' @section Parameters: | ||
#' * `source` A known-file id, or a CSV(?) file path to read in | ||
#' * `format` One of `c("parquet", "feather", "fst")` | ||
#' * `file_type` One of `c("parquet", "feather", "fst")` | ||
#' * `compression` One of the values: `r paste(known_compressions, collapse = ", ")` | ||
#' * `output` One of `c("arrow_table", "data_frame")` | ||
#' * `output_type` One of `c("arrow_table", "data_frame")` | ||
Comment on lines
-5
to
+7
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm fine changing these here if it's easiest to go this direction, though I kinda wish we went the other way (but understand migrating out of that would be a pain in the ass...) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I definitely don't love these changes, but they're the easiest way to keep consistency with history. There's nothing stopping us from changing back and migrating the database in the future if we want, though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, and if we do decided to do it in the future, doing it separate from these specific changes will make it easier to both do and to diagnose if something goes wrong |
||
#' | ||
#' @export | ||
read_file <- Benchmark("read_file", | ||
setup = function(source = names(known_sources), | ||
read_file <- Benchmark("file-read", | ||
setup = function(source = c("fanniemae_2016Q4", "nyctaxi_2010-01"), | ||
# TODO: break out feather_v1 and feather_v2, feather_v2 only in >= 0.17 | ||
format = c("parquet", "feather"), | ||
file_type = c("parquet", "feather"), | ||
compression = c("uncompressed", "snappy", "lz4"), | ||
output = c("arrow_table", "data_frame")) { | ||
# format defaults to parquet or feather, but can accept fst as well | ||
format <- match.arg(format, c("parquet", "feather", "fst")) | ||
# the output defaults are retrieved from the function definition (arrow_table and data_frame) | ||
output <- match.arg(output) | ||
output_type = c("arrow_table", "data_frame")) { | ||
# file_type defaults to parquet or feather, but can accept fst as well | ||
file_type <- match.arg(file_type, c("parquet", "feather", "fst")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need {fst} in here anymore? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can still run it for comparison if we want, though AFAIK we don't have a policy on when we actually do run external packages and compare against them. It might be nice to run these when there's an Arrow release and include them for context in that report? Not priority, but maybe useful context. Or was {fst} deprecated or something? I thought I remembered something like that, but now I don't see it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think maybe you are thinking disk.frame. And I think the subtext here is that it really doesn't matter too much to include it so why remove. Convinced. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, yep, that was it! Thanks There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, we might need to drop support for fst when we move to datalogistik (or do something slightly different if we were to use it), but for now let's keep it here |
||
# the output_type defaults are retrieved from the function definition (arrow_table and data_frame) | ||
output_type <- match.arg(output_type) | ||
|
||
# ensure that we have the right kind of file available | ||
input_file <- ensure_format(source, format, compression) | ||
# retrieve the dimnesions for run-checking after the benchmark | ||
input_file <- ensure_format(source, file_type, compression) | ||
# retrieve the dimensions for run-checking after the benchmark | ||
result_dim <- get_source_attr(source, "dim") | ||
|
||
# put the necesary variables into a BenchmarkEnvironment to be used when the | ||
# put the necessary variables into a BenchmarkEnvironment to be used when the | ||
# benchmark is running. | ||
BenchEnvironment( | ||
# get the correct read function for the input format | ||
read_func = get_read_function(format), | ||
# get the correct read function for the input file_type | ||
read_func = get_read_function(file_type), | ||
input_file = input_file, | ||
result_dim = result_dim, | ||
as_data_frame = output == "data_frame" | ||
as_data_frame = output_type == "data_frame" | ||
) | ||
}, | ||
# delete the results before each iteration | ||
|
@@ -48,44 +48,44 @@ read_file <- Benchmark("read_file", | |
}, | ||
# validate that the parameters given are compatible | ||
valid_params = function(params) { | ||
# make sure that the format and the compression is compatible | ||
# and fst doesn't have arrow_table output | ||
drop <- !validate_format(params$format, params$compression) | | ||
params$output == "arrow_table" & params$format == "fst" | ||
# make sure that the file_type and the compression is compatible | ||
# and fst doesn't have arrow_table output_type | ||
drop <- !validate_format(params$file_type, params$compression) | | ||
params$output_type == "arrow_table" & params$file_type == "fst" | ||
params[!drop,] | ||
}, | ||
# packages used when specific formats are used | ||
# packages used when specific file_types are used | ||
packages_used = function(params) { | ||
pkg_map <- c( | ||
"feather" = "arrow", | ||
"parquet" = "arrow", | ||
"fst" = "fst" | ||
) | ||
pkg_map[params$format] | ||
pkg_map[params$file_type] | ||
} | ||
) | ||
|
||
#' Get a reader | ||
#' | ||
#' @param format what format to read | ||
#' @param file_type what file_type to read | ||
#' | ||
#' @return the read function to use | ||
#' @export | ||
get_read_function <- function(format) { | ||
get_read_function <- function(file_type) { | ||
pkg_map <- c( | ||
"feather" = "arrow", | ||
"parquet" = "arrow", | ||
"fst" = "fst" | ||
) | ||
library(pkg_map[[format]], character.only = TRUE, warn.conflicts = FALSE) | ||
library(pkg_map[[file_type]], character.only = TRUE, warn.conflicts = FALSE) | ||
|
||
if (format == "feather") { | ||
if (file_type == "feather") { | ||
return(function(...) arrow::read_feather(...)) | ||
} else if (format == "parquet") { | ||
} else if (file_type == "parquet") { | ||
return(function(...) arrow::read_parquet(...)) | ||
} else if (format == "fst") { | ||
} else if (file_type == "fst") { | ||
return(function(..., as_data_frame) fst::read_fst(...)) | ||
} else { | ||
stop("Unsupported format: ", format, call. = FALSE) | ||
stop("Unsupported file_type: ", file_type, call. = FALSE) | ||
} | ||
} |
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -14,7 +14,7 @@ | |||||||
#' | ||||||||
#' @importFrom waldo compare | ||||||||
#' @export | ||||||||
tpc_h <- Benchmark("tpc_h", | ||||||||
tpc_h <- Benchmark("tpch", | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [MINOR]: Since you are doing this, I wonder if we should also change tpc_h everywhere to tpch. This would include instances in this file and here: Lines 6 to 8 in 5abf163
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm open if we want to, though it may (?) conflict in some places with the work to implement datalogistik in #113. These name changes are a pain, but necessary because we've been sending the Python names to Conbench in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
100%. Thank you for doing them! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is another example where I wish we had called it |
||||||||
setup = function(engine = "arrow", | ||||||||
query_id = 1:22, | ||||||||
format = c("native", "parquet"), | ||||||||
|
@@ -135,6 +135,18 @@ tpc_h <- Benchmark("tpc_h", | |||||||
( params$engine == "dplyr" & params$format == "native" ) | ||||||||
params[!drop,] | ||||||||
}, | ||||||||
batch_id_fun = function(params) { | ||||||||
batch_id <- uuid() | ||||||||
paste0(batch_id, "-", params$scale_factor, substr(params$format, 1, 1)) | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know this is effectively copy|paste from voltrondata-labs/benchmarks but ouch that's kinda gross. Would you mind adding a todo here to make this be a bit nicer (probably need to properly stringify the scale factor and also have the format be full(er)?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure. We could also change it now, if you like; the only thing I can think of that might (? not sure) care about consistency here is @boshek's release report. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Put in a |
||||||||
}, | ||||||||
tags_fun = function(params) { | ||||||||
# for consistency with runs through voltrondata-labs/benchmarks | ||||||||
params$query_id <- sprintf("TPCH-%02d", params$query_id) | ||||||||
if (params$output == "data_frame") { | ||||||||
params$output <- NULL | ||||||||
} | ||||||||
params | ||||||||
}, | ||||||||
# packages used when specific formats are used | ||||||||
packages_used = function(params) { | ||||||||
c(params$engine, "dplyr", "lubridate") | ||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for only running on push to main? I definitely from time to time appreciate having CI on branches even before they get to the PR stage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were running duplicate builds of everything on PRs because they're both branches and pull requests, so I copied this from one of our other repos. We could drop lines 4-5 (I think?) if we want builds on all pushes to branches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point me to an example of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On your PR, if you expand the builds, there are two versions of each, one appended with
(pull_request)
, the other with(push)
:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, those are testing two different things though (check out the details of the checkouts). The
push
is theHEAD
of the branch, thepull_request
is the merge ofHEAD
of branch + main. For that specific PR they end up being identical code wise (since I rebased the branch on main a few moments ago), but they can be and frequently are quite differentpush
https://github.com/voltrondata-labs/arrowbench/actions/runs/3605143011/jobs/6075232144#step:2:455
pull_request
https://github.com/voltrondata-labs/arrowbench/actions/runs/3605143166/jobs/6075232423#step:2:459
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, interesting. Though it seems like
pull_request
is what I'd expect aside from on the default branch. But I can revert.