Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TPC-H PPL query suite #830

Merged
merged 7 commits into from
Nov 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,8 @@ lazy val integtest = (project in file("integ-test"))
inConfig(IntegrationTest)(Defaults.testSettings ++ Seq(
IntegrationTest / javaSource := baseDirectory.value / "src/integration/java",
IntegrationTest / scalaSource := baseDirectory.value / "src/integration/scala",
IntegrationTest / parallelExecution := false,
IntegrationTest / resourceDirectory := baseDirectory.value / "src/integration/resources",
IntegrationTest / parallelExecution := false,
IntegrationTest / fork := true,
)),
inConfig(AwsIntegrationTest)(Defaults.testSettings ++ Seq(
Expand Down
4 changes: 4 additions & 0 deletions docs/ppl-lang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,10 @@ For additional examples see the next [documentation](PPL-Example-Commands.md).
### Example PPL Queries
See samples of [PPL queries](PPL-Example-Commands.md)

---
### TPC-H PPL Query Rewriting
See samples of [TPC-H PPL query rewriting](ppl-tpch.md)

---
### Planned PPL Commands

Expand Down
102 changes: 102 additions & 0 deletions docs/ppl-lang/ppl-tpch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
## TPC-H Benchmark

TPC-H is a decision support benchmark designed to evaluate the performance of database systems in handling complex business-oriented queries and concurrent data modifications. The benchmark utilizes a dataset that is broadly representative of various industries, making it widely applicable. TPC-H simulates a decision support environment where large volumes of data are analyzed, intricate queries are executed, and critical business questions are answered.

### Test PPL Queries

TPC-H 22 test query statements: [TPCH-Query-PPL](https://github.com/opensearch-project/opensearch-spark/blob/main/integ-test/src/integration/resources/tpch)

### Data Preparation

#### Option 1 - from PyPi

```
# Create the virtual environment
python3 -m venv .venv

# Activate the virtual environment
. .venv/bin/activate

pip install tpch-datagen
```

#### Option 2 - from source

```
git clone https://github.com/gizmodata/tpch-datagen

cd tpch-datagen

# Create the virtual environment
python3 -m venv .venv

# Activate the virtual environment
. .venv/bin/activate

# Upgrade pip, setuptools, and wheel
pip install --upgrade pip setuptools wheel

# Install TPC-H Datagen - in editable mode with client and dev dependencies
pip install --editable .[dev]
```

#### Usage

Here are the options for the tpch-datagen command:
```
tpch-datagen --help
Usage: tpch-datagen [OPTIONS]

Options:
--version / --no-version Prints the TPC-H Datagen package version and
exits. [required]
--scale-factor INTEGER The TPC-H Scale Factor to use for data
generation.
--data-directory TEXT The target output data directory to put the
files into [default: data; required]
--work-directory TEXT The work directory to use for data
generation. [default: /tmp; required]
--overwrite / --no-overwrite Can we overwrite the target directory if it
already exists... [default: no-overwrite;
required]
--num-chunks INTEGER The number of chunks that will be generated
- more chunks equals smaller memory
requirements, but more files generated.
[default: 10; required]
--num-processes INTEGER The maximum number of processes for the
multi-processing pool to use for data
generation. [default: 10; required]
--duckdb-threads INTEGER The number of DuckDB threads to use for data
generation (within each job process).
[default: 1; required]
--per-thread-output / --no-per-thread-output
Controls whether to write the output to a
single file or multiple files (for each
process). [default: per-thread-output;
required]
--compression-method [none|snappy|gzip|zstd]
The compression method to use for the
parquet files generated. [default: zstd;
required]
--file-size-bytes TEXT The target file size for the parquet files
generated. [default: 100m; required]
--help Show this message and exit.
```

### Generate 1 GB data with zstd (by default) compression

```
tpch-datagen --scale-factor 1
```

### Generate 10 GB data with snappy compression

```
tpch-datagen --scale-factor 10 --compression-method snappy
```

### Query Test

All TPC-H PPL Queries located in `integ-test/src/integration/resources/tpch` folder.

To test all queries, run `org.opensearch.flint.spark.ppl.tpch.TPCHQueryITSuite`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LantaoJin can u plz add here a result printout of a running of the TPCHQueryITSuite

Copy link
Member Author

@LantaoJin LantaoJin Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current suite just check the generated code (Spark whole stage codegen) for all 22 PPL queries can be properly compiled. The output shows the codegen time and execution time. They are not stable which depends on hardware. So there is no input data required for this PR. This PR is just for functionality coverage as a test suite.
Later I will create another issue for benchmarking.

35 changes: 35 additions & 0 deletions integ-test/src/integration/resources/tpch/q1.ppl
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
/*
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= date '1998-12-01' - interval '90' day
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus
*/

source = lineitem
| where l_shipdate <= subdate(date('1998-12-01'), 90)
| stats sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count() as count_order
by l_returnflag, l_linestatus
| sort l_returnflag, l_linestatus
45 changes: 45 additions & 0 deletions integ-test/src/integration/resources/tpch/q10.ppl
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
/*
select
c_custkey,
c_name,
sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment
from
customer,
orders,
lineitem,
nation
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate >= date '1993-10-01'
and o_orderdate < date '1993-10-01' + interval '3' month
and l_returnflag = 'R'
and c_nationkey = n_nationkey
group by
c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
order by
revenue desc
limit 20
*/

source = customer
| join ON c_custkey = o_custkey orders
| join ON l_orderkey = o_orderkey lineitem
| join ON c_nationkey = n_nationkey nation
| where o_orderdate >= date('1993-10-01')
AND o_orderdate < date_add(date('1993-10-01'), interval 3 month)
AND l_returnflag = 'R'
| stats sum(l_extendedprice * (1 - l_discount)) as revenue by c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment
| sort - revenue
| head 20
45 changes: 45 additions & 0 deletions integ-test/src/integration/resources/tpch/q11.ppl
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
/*
select
ps_partkey,
sum(ps_supplycost * ps_availqty) as value
from
partsupp,
supplier,
nation
where
ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = 'GERMANY'
group by
ps_partkey having
sum(ps_supplycost * ps_availqty) > (
select
sum(ps_supplycost * ps_availqty) * 0.0001000000
from
partsupp,
supplier,
nation
where
ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = 'GERMANY'
)
order by
value desc
*/

source = partsupp
| join ON ps_suppkey = s_suppkey supplier
| join ON s_nationkey = n_nationkey nation
| where n_name = 'GERMANY'
| stats sum(ps_supplycost * ps_availqty) as value by ps_partkey
| where value > [
source = partsupp
| join ON ps_suppkey = s_suppkey supplier
| join ON s_nationkey = n_nationkey nation
| where n_name = 'GERMANY'
| stats sum(ps_supplycost * ps_availqty) as check
| eval threshold = check * 0.0001000000
| fields threshold
]
| sort - value
42 changes: 42 additions & 0 deletions integ-test/src/integration/resources/tpch/q12.ppl
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
/*
select
l_shipmode,
sum(case
when o_orderpriority = '1-URGENT'
or o_orderpriority = '2-HIGH'
then 1
else 0
end) as high_line_count,
sum(case
when o_orderpriority <> '1-URGENT'
and o_orderpriority <> '2-HIGH'
then 1
else 0
end) as low_line_count
from
orders,
lineitem
where
o_orderkey = l_orderkey
and l_shipmode in ('MAIL', 'SHIP')
and l_commitdate < l_receiptdate
and l_shipdate < l_commitdate
and l_receiptdate >= date '1994-01-01'
and l_receiptdate < date '1994-01-01' + interval '1' year
group by
l_shipmode
order by
l_shipmode
*/

source = orders
| join ON o_orderkey = l_orderkey lineitem
| where l_commitdate < l_receiptdate
and l_shipdate < l_commitdate
and l_shipmode in ('MAIL', 'SHIP')
and l_receiptdate >= date('1994-01-01')
and l_receiptdate < date_add(date('1994-01-01'), interval 1 year)
| stats sum(case(o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH', 1 else 0)) as high_line_count,
sum(case(o_orderpriority != '1-URGENT' and o_orderpriority != '2-HIGH', 1 else 0)) as low_line_countby
by l_shipmode
| sort l_shipmode
31 changes: 31 additions & 0 deletions integ-test/src/integration/resources/tpch/q13.ppl
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
/*
select
c_count,
count(*) as custdist
from
(
select
c_custkey,
count(o_orderkey) as c_count
from
customer left outer join orders on
c_custkey = o_custkey
and o_comment not like '%special%requests%'
group by
c_custkey
) as c_orders
group by
c_count
order by
custdist desc,
c_count desc
*/

source = [
source = customer
| left outer join ON c_custkey = o_custkey AND not like(o_comment, '%special%requests%')
orders
| stats count(o_orderkey) as c_count by c_custkey
] as c_orders
| stats count() as custdist by c_count
| sort - custdist, - c_count
25 changes: 25 additions & 0 deletions integ-test/src/integration/resources/tpch/q14.ppl
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
/*
select
100.00 * sum(case
when p_type like 'PROMO%'
then l_extendedprice * (1 - l_discount)
else 0
end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue
from
lineitem,
part
where
l_partkey = p_partkey
and l_shipdate >= date '1995-09-01'
and l_shipdate < date '1995-09-01' + interval '1' month
*/

source = lineitem
| join ON l_partkey = p_partkey
AND l_shipdate >= date('1995-09-01')
AND l_shipdate < date_add(date('1995-09-01'), interval 1 month)
part
| stats sum(case(like(p_type, 'PROMO%'), l_extendedprice * (1 - l_discount) else 0)) as sum1,
sum(l_extendedprice * (1 - l_discount)) as sum2
| eval promo_revenue = 100.00 * sum1 / sum2 // Stats and Eval commands can combine when issues/819 resolved
| fields promo_revenue
Loading
Loading