Skip to content

Commit

Permalink
export generate_tpch (#112)
Browse files Browse the repository at this point in the history
* export generate_tpch

* expand description

* Update R/ensure-tpch-source.R

Co-authored-by: Jonathan Keane <jkeane@gmail.com>

* update README

* render docs

Co-authored-by: Jonathan Keane <jkeane@gmail.com>
  • Loading branch information
boshek and jonkeane authored Sep 14, 2022
1 parent 838216e commit 5abf163
Show file tree
Hide file tree
Showing 4 changed files with 40 additions and 1 deletion.
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ export(ensure_dataset)
export(ensure_format)
export(ensure_source)
export(file_with_ext)
export(generate_tpch)
export(get_csv_reader)
export(get_csv_writer)
export(get_dataset_attr)
Expand Down
15 changes: 15 additions & 0 deletions R/ensure-tpch-source.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,21 @@
#' @export
tpch_tables <- c("customer", "lineitem", "nation", "orders", "part", "partsupp", "region", "supplier")

#' Generate tpch data
#'
#' Generate tpch data at a given scale factor. By default,
#' data is output relative to the current working directory. However,
#' you can set the environment variable `ARROWBENCH_DATA_DIR` to
#' point to another directory. Setting this environment variable has
#' the advantage of being a central location for general usage. Running
#' this function will install a custom version of duckdb in an `r_libs`
#' directory, relative to the directory specified by the environment
#' variable `ARROWBENCH_LOCAL_DIR`. When running this function for the first time you will
#' see significant output from that installation process. This is normal.
#'
#' @param scale_factor a relative measure of the size of data in gigabytes.
#'
#' @export
generate_tpch <- function(scale_factor = 1) {
# Ensure that we have our custom duckdb that has the TPC-H extension built.
ensure_custom_duckdb()
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,8 @@ Source files are cached in a `data` directory and are only downloaded if
not present. This speeds up repeat benchmark runs on the same host. By default,
`data` is assumed to be relative to the current working directory, but
you can set the environment variable `ARROWBENCH_DATA_DIR` to point to another
(permanent) base directory.
base directory. Setting this environment variable has the advantage of being a
central location for general usage.

Similarly, there is an `ensure_lib()` function called in the `global_setup()`
that supports a list of known `arrow` package versions, which are mapped to
Expand Down
22 changes: 22 additions & 0 deletions man/generate_tpch.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit 5abf163

Please sign in to comment.