-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TPC-H PPL query suite #830
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
3b466dd
Add TPC-H PPL query suite
LantaoJin 98e5e14
Merge remote-tracking branch 'upstream/main' into issues/806
LantaoJin 4697791
fix failure of loading resources
LantaoJin c255c52
Merge remote-tracking branch 'upstream/main' into issues/806
LantaoJin 5007321
fix data_add()
LantaoJin 6678fa9
Merge remote-tracking branch 'upstream/main' into issues/806
LantaoJin bfefaca
enable q21 and add docs
LantaoJin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
## TPC-H Benchmark | ||
|
||
TPC-H is a decision support benchmark designed to evaluate the performance of database systems in handling complex business-oriented queries and concurrent data modifications. The benchmark utilizes a dataset that is broadly representative of various industries, making it widely applicable. TPC-H simulates a decision support environment where large volumes of data are analyzed, intricate queries are executed, and critical business questions are answered. | ||
|
||
### Test PPL Queries | ||
|
||
TPC-H 22 test query statements: [TPCH-Query-PPL](https://github.com/opensearch-project/opensearch-spark/blob/main/integ-test/src/integration/resources/tpch) | ||
|
||
### Data Preparation | ||
|
||
#### Option 1 - from PyPi | ||
|
||
``` | ||
# Create the virtual environment | ||
python3 -m venv .venv | ||
|
||
# Activate the virtual environment | ||
. .venv/bin/activate | ||
|
||
pip install tpch-datagen | ||
``` | ||
|
||
#### Option 2 - from source | ||
|
||
``` | ||
git clone https://github.com/gizmodata/tpch-datagen | ||
|
||
cd tpch-datagen | ||
|
||
# Create the virtual environment | ||
python3 -m venv .venv | ||
|
||
# Activate the virtual environment | ||
. .venv/bin/activate | ||
|
||
# Upgrade pip, setuptools, and wheel | ||
pip install --upgrade pip setuptools wheel | ||
|
||
# Install TPC-H Datagen - in editable mode with client and dev dependencies | ||
pip install --editable .[dev] | ||
``` | ||
|
||
#### Usage | ||
|
||
Here are the options for the tpch-datagen command: | ||
``` | ||
tpch-datagen --help | ||
Usage: tpch-datagen [OPTIONS] | ||
|
||
Options: | ||
--version / --no-version Prints the TPC-H Datagen package version and | ||
exits. [required] | ||
--scale-factor INTEGER The TPC-H Scale Factor to use for data | ||
generation. | ||
--data-directory TEXT The target output data directory to put the | ||
files into [default: data; required] | ||
--work-directory TEXT The work directory to use for data | ||
generation. [default: /tmp; required] | ||
--overwrite / --no-overwrite Can we overwrite the target directory if it | ||
already exists... [default: no-overwrite; | ||
required] | ||
--num-chunks INTEGER The number of chunks that will be generated | ||
- more chunks equals smaller memory | ||
requirements, but more files generated. | ||
[default: 10; required] | ||
--num-processes INTEGER The maximum number of processes for the | ||
multi-processing pool to use for data | ||
generation. [default: 10; required] | ||
--duckdb-threads INTEGER The number of DuckDB threads to use for data | ||
generation (within each job process). | ||
[default: 1; required] | ||
--per-thread-output / --no-per-thread-output | ||
Controls whether to write the output to a | ||
single file or multiple files (for each | ||
process). [default: per-thread-output; | ||
required] | ||
--compression-method [none|snappy|gzip|zstd] | ||
The compression method to use for the | ||
parquet files generated. [default: zstd; | ||
required] | ||
--file-size-bytes TEXT The target file size for the parquet files | ||
generated. [default: 100m; required] | ||
--help Show this message and exit. | ||
``` | ||
|
||
### Generate 1 GB data with zstd (by default) compression | ||
|
||
``` | ||
tpch-datagen --scale-factor 1 | ||
``` | ||
|
||
### Generate 10 GB data with snappy compression | ||
|
||
``` | ||
tpch-datagen --scale-factor 10 --compression-method snappy | ||
``` | ||
|
||
### Query Test | ||
|
||
All TPC-H PPL Queries located in `integ-test/src/integration/resources/tpch` folder. | ||
|
||
To test all queries, run `org.opensearch.flint.spark.ppl.tpch.TPCHQueryITSuite`. | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
/* | ||
select | ||
l_returnflag, | ||
l_linestatus, | ||
sum(l_quantity) as sum_qty, | ||
sum(l_extendedprice) as sum_base_price, | ||
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, | ||
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, | ||
avg(l_quantity) as avg_qty, | ||
avg(l_extendedprice) as avg_price, | ||
avg(l_discount) as avg_disc, | ||
count(*) as count_order | ||
from | ||
lineitem | ||
where | ||
l_shipdate <= date '1998-12-01' - interval '90' day | ||
group by | ||
l_returnflag, | ||
l_linestatus | ||
order by | ||
l_returnflag, | ||
l_linestatus | ||
*/ | ||
|
||
source = lineitem | ||
| where l_shipdate <= subdate(date('1998-12-01'), 90) | ||
| stats sum(l_quantity) as sum_qty, | ||
sum(l_extendedprice) as sum_base_price, | ||
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, | ||
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, | ||
avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, | ||
avg(l_discount) as avg_disc, | ||
count() as count_order | ||
by l_returnflag, l_linestatus | ||
| sort l_returnflag, l_linestatus |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
/* | ||
select | ||
c_custkey, | ||
c_name, | ||
sum(l_extendedprice * (1 - l_discount)) as revenue, | ||
c_acctbal, | ||
n_name, | ||
c_address, | ||
c_phone, | ||
c_comment | ||
from | ||
customer, | ||
orders, | ||
lineitem, | ||
nation | ||
where | ||
c_custkey = o_custkey | ||
and l_orderkey = o_orderkey | ||
and o_orderdate >= date '1993-10-01' | ||
and o_orderdate < date '1993-10-01' + interval '3' month | ||
and l_returnflag = 'R' | ||
and c_nationkey = n_nationkey | ||
group by | ||
c_custkey, | ||
c_name, | ||
c_acctbal, | ||
c_phone, | ||
n_name, | ||
c_address, | ||
c_comment | ||
order by | ||
revenue desc | ||
limit 20 | ||
*/ | ||
|
||
source = customer | ||
| join ON c_custkey = o_custkey orders | ||
| join ON l_orderkey = o_orderkey lineitem | ||
| join ON c_nationkey = n_nationkey nation | ||
| where o_orderdate >= date('1993-10-01') | ||
AND o_orderdate < date_add(date('1993-10-01'), interval 3 month) | ||
AND l_returnflag = 'R' | ||
| stats sum(l_extendedprice * (1 - l_discount)) as revenue by c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment | ||
| sort - revenue | ||
| head 20 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
/* | ||
select | ||
ps_partkey, | ||
sum(ps_supplycost * ps_availqty) as value | ||
from | ||
partsupp, | ||
supplier, | ||
nation | ||
where | ||
ps_suppkey = s_suppkey | ||
and s_nationkey = n_nationkey | ||
and n_name = 'GERMANY' | ||
group by | ||
ps_partkey having | ||
sum(ps_supplycost * ps_availqty) > ( | ||
select | ||
sum(ps_supplycost * ps_availqty) * 0.0001000000 | ||
from | ||
partsupp, | ||
supplier, | ||
nation | ||
where | ||
ps_suppkey = s_suppkey | ||
and s_nationkey = n_nationkey | ||
and n_name = 'GERMANY' | ||
) | ||
order by | ||
value desc | ||
*/ | ||
|
||
source = partsupp | ||
| join ON ps_suppkey = s_suppkey supplier | ||
| join ON s_nationkey = n_nationkey nation | ||
| where n_name = 'GERMANY' | ||
| stats sum(ps_supplycost * ps_availqty) as value by ps_partkey | ||
| where value > [ | ||
source = partsupp | ||
| join ON ps_suppkey = s_suppkey supplier | ||
| join ON s_nationkey = n_nationkey nation | ||
| where n_name = 'GERMANY' | ||
| stats sum(ps_supplycost * ps_availqty) as check | ||
| eval threshold = check * 0.0001000000 | ||
| fields threshold | ||
] | ||
| sort - value |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
/* | ||
select | ||
l_shipmode, | ||
sum(case | ||
when o_orderpriority = '1-URGENT' | ||
or o_orderpriority = '2-HIGH' | ||
then 1 | ||
else 0 | ||
end) as high_line_count, | ||
sum(case | ||
when o_orderpriority <> '1-URGENT' | ||
and o_orderpriority <> '2-HIGH' | ||
then 1 | ||
else 0 | ||
end) as low_line_count | ||
from | ||
orders, | ||
lineitem | ||
where | ||
o_orderkey = l_orderkey | ||
and l_shipmode in ('MAIL', 'SHIP') | ||
and l_commitdate < l_receiptdate | ||
and l_shipdate < l_commitdate | ||
and l_receiptdate >= date '1994-01-01' | ||
and l_receiptdate < date '1994-01-01' + interval '1' year | ||
group by | ||
l_shipmode | ||
order by | ||
l_shipmode | ||
*/ | ||
|
||
source = orders | ||
| join ON o_orderkey = l_orderkey lineitem | ||
| where l_commitdate < l_receiptdate | ||
and l_shipdate < l_commitdate | ||
and l_shipmode in ('MAIL', 'SHIP') | ||
and l_receiptdate >= date('1994-01-01') | ||
and l_receiptdate < date_add(date('1994-01-01'), interval 1 year) | ||
| stats sum(case(o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH', 1 else 0)) as high_line_count, | ||
sum(case(o_orderpriority != '1-URGENT' and o_orderpriority != '2-HIGH', 1 else 0)) as low_line_countby | ||
by l_shipmode | ||
| sort l_shipmode |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
/* | ||
select | ||
c_count, | ||
count(*) as custdist | ||
from | ||
( | ||
select | ||
c_custkey, | ||
count(o_orderkey) as c_count | ||
from | ||
customer left outer join orders on | ||
c_custkey = o_custkey | ||
and o_comment not like '%special%requests%' | ||
group by | ||
c_custkey | ||
) as c_orders | ||
group by | ||
c_count | ||
order by | ||
custdist desc, | ||
c_count desc | ||
*/ | ||
|
||
source = [ | ||
source = customer | ||
| left outer join ON c_custkey = o_custkey AND not like(o_comment, '%special%requests%') | ||
orders | ||
| stats count(o_orderkey) as c_count by c_custkey | ||
] as c_orders | ||
| stats count() as custdist by c_count | ||
| sort - custdist, - c_count |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
/* | ||
select | ||
100.00 * sum(case | ||
when p_type like 'PROMO%' | ||
then l_extendedprice * (1 - l_discount) | ||
else 0 | ||
end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue | ||
from | ||
lineitem, | ||
part | ||
where | ||
l_partkey = p_partkey | ||
and l_shipdate >= date '1995-09-01' | ||
and l_shipdate < date '1995-09-01' + interval '1' month | ||
*/ | ||
|
||
source = lineitem | ||
| join ON l_partkey = p_partkey | ||
AND l_shipdate >= date('1995-09-01') | ||
AND l_shipdate < date_add(date('1995-09-01'), interval 1 month) | ||
part | ||
| stats sum(case(like(p_type, 'PROMO%'), l_extendedprice * (1 - l_discount) else 0)) as sum1, | ||
sum(l_extendedprice * (1 - l_discount)) as sum2 | ||
| eval promo_revenue = 100.00 * sum1 / sum2 // Stats and Eval commands can combine when issues/819 resolved | ||
| fields promo_revenue |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LantaoJin can u plz add here a result printout of a running of the
TPCHQueryITSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current suite just check the generated code (Spark whole stage codegen) for all 22 PPL queries can be properly compiled. The output shows the codegen time and execution time. They are not stable which depends on hardware. So there is no input data required for this PR. This PR is just for functionality coverage as a test suite.
Later I will create another issue for benchmarking.