Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use benchconnect to publish results to a Conbench server #127

Merged
merged 8 commits into from
Jan 23, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Imports:
distro,
glue,
jsonlite,
processx,
progress,
purrr,
R6,
Expand Down Expand Up @@ -72,8 +73,10 @@ Collate:
'known-sources.R'
'ensure-source.R'
'ensure-tpch-source.R'
'external-dependencies.R'
'measure.R'
'params.R'
'publish.R'
'util.R'
'result.R'
'run.R'
Expand Down
3 changes: 3 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ export(get_read_function)
export(get_source_attr)
export(get_sql_query_func)
export(get_write_function)
export(install_benchconnect)
export(install_datalogistik)
export(install_pipx)
export(known_compressions)
export(known_formats)
export(known_sources)
Expand Down
111 changes: 111 additions & 0 deletions R/external-dependencies.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
external_cli_available <- function(cli) {
exit_code <- system(paste("which", cli), ignore.stdout = TRUE, ignore.stderr = TRUE)

if (exit_code != 0L) {
msg <- paste(cli, 'not installed or on $PATH.\n\n')
if (cli == "pipx") {
msg <- paste0(
msg,
glue::glue('It can be installed with `install_pipx()`\n\n'),
'If already installed, ensure it is on $PATH, e.g. by running',
'`pipx ensurepath` or adding `PATH="${PATH}:${HOME}/.local/bin"` to ~/.Renviron'
)
} else {
msg <- paste0(
msg,
glue::glue('It can be installed with `install_pipx(); install_{cli}()`\n\n'),
'If already installed with pipx, ensure it is on $PATH, e.g. by running',
'`pipx ensurepath` or adding `PATH="${PATH}:${HOME}/.local/bin"` to ~/.Renviron'
)
}

warning(warningCondition(msg, class = "notInstalledWarning"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

}

exit_code == 0L
}

pipx_available <- function() {
external_cli_available(cli = "pipx")
}

benchconnect_available <- function() {
external_cli_available(cli = "benchconnect")
}

datalogistik_available <- function() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried messing around with how vscode versus RStudio picks up these paths. Though it is likely something that I've done locally I just wanted to highlight that vscode was only able to pick the paths from the Renviron file. I only raise this because if this got wider usage this issue would likely crop up. Likely fine for now though given limited use of this package for the specific task of arrow R benchmarks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, to get some of our conversation yesterday where it should be: Mostly this shouldn't be too much of an issue for CI, since everything is run from the command line, but for users running from apps like RStudio, they may need to mess with PATH via .Renviron or otherwise.

Again, since benchconnect only really needs to be installed where we publish, this probably isn't too much of an issue because that's a lot of machines handled by CI, but for datalogistik, this matters more since benchmark authors will also need it to work locally.

external_cli_available(cli = "datalogistik")
}


#' Install pipx
#'
#' Install [pipx](https://pypa.github.io/pipx/), a version of pip that installs
#' Python packages in isolated environments where they will always be available
#' regardless of which version of Python is presently on `$PATH`. Especially
#' useful for installing packages designed to be used via CLIs.
#'
#' @export
install_pipx <- function() {
system('pip install pipx && pipx ensurepath', intern = TRUE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially naive question, but is there any reason we aren't using processx::run here if only for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, this is mostly because I wrote this before I ran into the issues which drove me to processx::run() in the first place 😄

We do still have a number of system() calls; not just these, but also the very important one here that runs the script. These are easy to change over, but that one might be trickier—I'm pretty sure there are some subtleties in how that's set up, and I really don't want to bork that one. It's very likely worth changing, but I'd want to make sure Jon has time to do a very thorough review of that particular change.

}


#' Install benchconnect
#'
#' Install [benchconnect](https://github.com/conbench/conbench/tree/main/benchconnect),
#' a utility for sending benchmark results to a Conbench server
#'
#' @export
install_benchconnect <- function() {
stopifnot(pipx_available())

url <- "benchconnect@git+https://github.com/conbench/conbench.git@main#subdirectory=benchconnect"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if it might make sense to wrap this in a function so that a) we could add other bits from conbench more easily if ever needed and b) at least bake in some ability to change which branch/ref benchconnect is using (much like below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've already run into some issues in development where I needed a branch version of benchadapt (a dep of benchconnect) installed, which led to me learning more about pipx.

Long-term, though, the solution here is to put benchconnect and datalogistik on PyPI and/or conda-forge so we can use more normal Python dependency management instead; installing from GitHub has a number of unpleasant complications. I think Austin has some thoughts on that and our general monorepo strategy for Conbench.

pipx_call <- "pipx install --include-deps"

if (suppressWarnings(benchconnect_available(), classes = "notInstalledWarning")) {
if (interactive()) {
ans <- readline("benchconnect already installed. Update? [Y/n]: ")
} else {
ans <- "y"
}
if (tolower(ans) %in% c("y", "")) {
system(paste(pipx_call, "--force", url), intern = TRUE)
} else {
invisible()
}
} else {
system(paste(pipx_call, url), intern = TRUE)
}
}


#' Install datalogistik
#'
#' Install [datalogistik](https://github.com/conbench/datalogistik), a utility
#' for generating, downloading, and converting datasets for benchmarking.
#'
#' Only for interactive use.
#'
#' @export
install_datalogistik <- function() {
# TODO: install pipx?
stopifnot(pipx_available())

ref <- Sys.getenv("DATALOGISTIK_BRANCH", unset = "main")
url <- glue("git+https://github.com/conbench/datalogistik.git@{ref}")

pipx_call <- "pipx install --pip-args '--extra-index-url https://pypi.fury.io/arrow-nightlies --prefer-binary'"
if (datalogistik_available()) {
# default to yes (and also this will make it work in non-interactive sessions)
ans <- readline("datalogistik already installed. Update? [Y/n]: ")
if (tolower(ans) %in% c("y", "")) {
# we need the extra args to depend on the development version of arrow
return(system(glue("{pipx_call} --force {url}"), intern = TRUE))
} else {
return(invisible())
}
}

system(glue("{pipx_call} {url}"), intern = TRUE)
}
38 changes: 38 additions & 0 deletions R/publish.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Call benchconnect
#
# @param args A character vector of arguments to pass to the benchconnect binary
#
# @returns A string of stdout returned by the call
call_benchconnect <- function(args) {
stopifnot(benchconnect_available())
res <- processx::run(command = "benchconnect", args = args, echo_cmd = TRUE, echo = TRUE)
message(res$stderr)
res$stdout
}


augment_run <- function(run) {
stdout <- call_benchconnect(c("augment", "run", "--json", run$json))
BenchmarkRun$from_json(stdout)
}

augment_result <- function(result) {
stdout <- call_benchconnect(c("augment", "result", "--json", result$json))
BenchmarkResult$from_json(stdout)
}


start_run <- function(run) {
call_benchconnect(c("start", "run", "--json", run$json))
}

submit_result <- function(result) {
call_benchconnect(c("submit", "result", "--json", result$json))
}

finish_run <- function(run) {
# Ed note: `run` is not used right now, but there are some things we can pass
# here in the future, so I put it here for parallelism for now. Since it is
# not evaluated, it doesn't need to be specified for now.
call_benchconnect(c("finish", "run", "--json", "{}"))
}
54 changes: 53 additions & 1 deletion R/result.R
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ BenchmarkResult <- R6Point1Class(
machine_info = NULL,
cluster_info = NULL,
context = NULL,
github = NULL) {
github = github_info()) {
self$run_name <- run_name
self$run_id <- run_id
self$batch_id <- batch_id
Expand Down Expand Up @@ -329,3 +329,55 @@ as.data.frame.BenchmarkResults <- function(x, row.names = NULL, optional = FALSE
as.data.frame.BenchmarkResult <- function(x, row.names = NULL, optional = FALSE, packages = "arrow", ...) {
x$to_dataframe(row.names = row.names, optional = optional, packages = packages, ...)
}


# A class for holding metadata on a benchmark run
#
# Because this class inherits from `Serializable`, it can be written to and
# instantiated from JSON forms.
#
# All attributes are active bindings so that validation can be run when they are
# set, whether during or after instantiation.
BenchmarkRun <- R6Point1Class(
classname = "BenchmarkRun",
inherit = Serializable,

public = list(
initialize = function(
name = NULL,
id = NULL,
reason = NULL,
info = NULL,
machine_info = NULL,
cluster_info = NULL,
github = github_info(),
finished_timestamp = NULL,
error_type = NULL,
error_info = NULL
) {
self$name <- name
self$id <- id
self$reason <- reason
self$info <- info
self$machine_info <- machine_info
self$cluster_info <- cluster_info
self$github <- github
self$finished_timestamp <- finished_timestamp
self$error_type <- error_type
self$error_info <- error_info
}
),

active = list(
name = function(name) private$get_or_set_serializable(variable = "name", value = name),
id = function(id) private$get_or_set_serializable(variable = "id", value = id),
reason = function(reason) private$get_or_set_serializable(variable = "reason", value = reason),
info = function(info) private$get_or_set_serializable(variable = "info", value = info),
machine_info = function(machine_info) private$get_or_set_serializable(variable = "machine_info", value = machine_info),
cluster_info = function(cluster_info) private$get_or_set_serializable(variable = "cluster_info", value = cluster_info),
github = function(github) private$get_or_set_serializable(variable = "github", value = github),
finished_timestamp = function(finished_timestamp) private$get_or_set_serializable(variable = "finished_timestamp", value = finished_timestamp),
error_type = function(error_type) private$get_or_set_serializable(variable = "error_type", value = error_type),
error_info = function(error_info) private$get_or_set_serializable(variable = "error_info", value = error_info)
)
)
Loading