mytpcds

Run TPC-DS against different databases including Hive, Spark SQL and IBM BigSQL

https://github.com/stanislawbartkowski/mytpcds/wiki

Test results: https://github.com/stanislawbartkowski/mytpcds/wiki/TPC-DS-BigData-results

TPC/DS version tested:

v2.13.0rc1
DSGen-software-code-3.2.0rc1

Inspiration

http://www.tpc.org/tpcds/

TPC-DS is an objective tool to measure and compare different databases systems. The same set of data and non trivial queries can be loaded and executed and give an insight how databases respond to the workload. Also, having expected result based on experience, the tool can be also used for testing and tunning the newly installed or upgraded database. But using TPC-DS is not easy out of the box. Requires some manual tasks to perform. So I decided to prepare an automated tool to do the task. Just download the TPC-DS files, configure and run it.

TPC-DS preparation

Unpack the compressed zip file in the directory. The following directory structure is created.

v2.10.1rc3/v2.11.0rc2/v2.13.0rc1
- answer_sets
- EULA.txt
- query_templates
- query_variants
- specification
- tests
- tools

For the purpose of the test the additional directories should be created.

v2.10.1rc3/v2.11.0rc2
- work
  - data (will contained input data)
  - db2queries (queries and results related to appropriate database)
  - mysqlqueries
  - (etc)

cd v2.10.1rc3(v2.11.0rc2)/tools
make

Executable files are created. To create an input data set run the command dsdgen. The parameter -sc describes the size of the data.

./dsdgen -dir ../work/data -sc 100

Tool description.

The following database servers are supported: PosgreSQL, MySQL (MariaDB), Oracle, DB2, Netezza, Hive, SparkSQL and IBM BigSQL. The tool consists of several simple bash and awk script files. The tool does not require any dependencies.

File	Description	Wiki
db/db2proc.sh	Implementation for DB2 and BigSQL	https://github.com/stanislawbartkowski/mytpcds/wiki/DB2 https://github.com/stanislawbartkowski/mytpcds/wiki/IBM-BigSQL
db/hiveproc.sh	Implementation for Hive and SparkSQL Thrive	https://github.com/stanislawbartkowski/mytpcds/wiki/SparkSQL-Thrive
db/netezzaproc.sh	Implementation for Netezza	https://github.com/stanislawbartkowski/mytpcds/wiki/Netezza
db/phoenixproc.sh	Implementation for HBase Phoenix (not working)
db/mysqlproc.sh	Implementation for MySQL/MariaDB	https://github.com/stanislawbartkowski/mytpcds/wiki/MySQL
db/oracleproc.sh	Implementation for Oracle	https://github.com/stanislawbartkowski/mytpcds/wiki/Oracle
db/psqlproc.sh	Implementation for PostreSQL	https://github.com/stanislawbartkowski/mytpcds/wiki/PostgreSQL
db/sparksqlproc.sh	Implementation for SparkSQL	https://github.com/stanislawbartkowski/mytpcds/wiki/SparkSQL-Thrive
db/jsqshproc.sh	Alternative solution for BigSQL, jsqsh	https://github.com/stanislawbartkowski/mytpcds/tree/master/RunQueries
proc	Several supporting bash and awk scripts	https://github.com/stanislawbartkowski/mytpcds/tree/master/proc
RunQueries	Java tool to run queries using JDBC connection	https://github.com/stanislawbartkowski/mytpcds/tree/master/proc
ptest.sh	Starter for Throughput Test
res	Expected result sets for Qualify Test
run.sh	Launching script file
tpc.sh	Main tpc-ds test runner
env	Configuration files for databases
transf.awk	AWK script file used to transform the results
env.templates	Templates for configuration files
sh.templates	Templates for lauch scripts
res	Refeence data set for Qualify Test
qualification	Queries parameters used to run Qualify Test

Java Query Runner

The queries are executed by an appropriate command line tool. But to pass Qualify Test, the output should match the reference answer result set. Because every tool comes with its own output format, it is not easy to find a common denominator for all databases. So I develeped a simple Java QueryRunner giving the same output format regardless of the database.

Build Java Query Runner

cd RunQueries
mvn package

The following directory structure should be created.

ll target/
  archive-tmp
  classes
  generated-sources
  lib
  maven-archiver
  maven-status
  RunQueries-1.0-SNAPSHOT.jar
  RunQueries-1.0-SNAPSHOT-jar-with-dependencies.jar

In order to use QueryRunner in the test, DBURL and JAVADRIVER configuration parameters should be specified.

Tool configuarion

Copy bash scripts from sh.templates to the directory.

cd mytpcds
cp sh.templates/* .

Create directory for database connection specification.

mkdir env

Copy relevent properties file from env.templates to env and configure according to yout evironment. For instance: to run Hive TPC/DS test.

cp env.templates/hive.rc env

The common configuration parameters

Parameter	Value	Example
DBNAME	Database name	PERFDB
DBUSER	Database user, should have full privileges in the database	perf
DBPASSWORD	Database password	secret
DBHOST	Database host name	netezza.fyre.ibm.com
DTYPE	Database identifier, should correspond to db/${DTYPE}proc.sh	netezza, points to db/netezzaproc.sh
DBURL	JDBC URL to connect to database	jdbc:netezza://$DBHOST:5480/$DBNAME
JAVADRIVER	JDBC Java driver jar file	/usr/local/nz/lib/nzjdbc3.jar

DBURL and JAVADRIVER should be specified only if QueryRunner is going to be used. If only database client software is used, these parameters could be ignored.

Troubleshooting

Logs are created in the directory pointed by TEMPDIR variable.

export TEMPDIR=/tmp/tpcds

Example:

cat /tmp/tpcds/db2log/mytcpds.log

SQL30082N  Security processing failed with reason "24" ("USERNAME AND/OR 
PASSWORD INVALID").  SQLSTATE=08001
Cannot connect to DB2
Exit immediately

The queries

Not all queries are ready to execute out of the box. The TPC-DS specification allows small alteration of the query to make them runnable (4.2.3.1).

It is recognized that implementations require specific adjustments for their operating environment and the syntactic variations of its dialect of the SQL language

To avoid keeping a different version of queries for every database, I decided to make amendments on the fly. Most changes are related to date arithmetics like adding or subtracting a number of days or table aliases. Changes are limited to basic text substitution without touching the query logic.

The changes are implemented in https://github.com/stanislawbartkowski/mytpcds/blob/master/tpc.sh script file, runsinglequery bash function.

After that, I ended up with the following queries coverage.

Database	Coverage
DB2	100%
Oracle	100%
MySQL/MariaDB	87%
PostgreSQL	97%
Hive 2.1	49%
SparkSQL 2.3	94%
Netezza/NPS	95%
IBM BigSQL	100%

Global settings

Variable name	Description	Default/sample value
TEMPDIR	Temporary directory for log file and temp files	/tmp/mytpcds
TCPROOT	Root directory for upacked TCP-DS payload	/home/sbartkowski/work/v2.10.0rc2
ENV	Resource file for a database under test	env/bigsql
TESTDATA	TCP-DS table used for test loading phase	call_center
TESTQUERY	Number of query used to execute a query test	04
DONOTVERIFY	If empty, run Test Validation (QUALIFY). If not empty, ignore Test Validation	X (not empty)
QUERYTIMEOUT	Query time execution thereshold. Parameter for timout command	5s (limit is 5 seconds)

Configure database properties

Prepare the server, the client and the connection. https://github.com/stanislawbartkowski/mytpcds/wiki contains a bunch of useful informations.

In conn.rc file uncomment the property file appropriate for a particular database and modify the file according to the environment.

https://github.com/stanislawbartkowski/mytpcds/blob/master/run.sh

For instance, for Oracle

#export ENV=env/db2
#export ENV=env/bigsql
export ENV=env/oracle
#export ENV=env/hive
#export ENV=env/postgresql
#export ENV=env/sparksql
#export ENV=env/thrive
#export ENV=env/phoenix
#export ENV=env/mysql
#export ENV=env/netezza

Qualify Test

Qualify dataset

Qualify database is 1GB size.

./dsdgen -dir ../work/data -sc 1

Qualify queries

Qualify queries are used to validate SQL SELECT statements. Qualify queries contains hardcoded parameter values and executed against qualify dataset should yield the same result set. The run.sh queryqualification task creates a serie of qualify queries. Every test templates has corresponding parameter definition in qualification directory. For instance, query template query_templates/query4.tpl has corresponding parameters set qualification/4.par

YEAR=2001
SELECTONE=t_s_secyear.customer_preferred_cust_flag

To prepare qualification queries:

Configure conn.rc and env/{db}proc.sh file
run.sh file, uncomment ./tpc.sh queryqualification line and comment out all other ./tpc.sh lines.
./run.sh

qgen2 Query Generator (Version 2.11.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2019
Warning: This scale factor is valid for QUALIFICATION ONLY
Parsed 99 templates
PASSED

Step 1, test the connection

Uncomment #./tpc.sh test line in run.sh file

./tpc.sh test
#./tpc.sh removedata
#./tpc.sh createtables
#./tpc.sh loadtest
#./tpc.sh testverify
#./tpc.sh testquery

#./tpc.sh loaddata
#./tpc.sh verifyload
#./tpc.sh runqueries

Execute ./run.sh. If connection is working, the output should be:

./run.sh 
PASSED

If the connection is not configured properly then look at output log file. The log directory is specified as LOGDIR in the resource file.

tail -f /tmp/mytpcds/oraclelog/mytcp.log

Step 2, create tables

Uncomment ./tpc.sh createtables line and run the script

#./tpc.sh test
#./tpc.sh removedata
./tpc.sh createtables
#./tpc.sh loadtest
#./tpc.sh testverify
#./tpc.sh testquery

#./tpc.sh loaddata
#./tpc.sh verifyload
#./tpc.sh runqueries

Step 3, load single table as a test

Uncomment ./tpc.sh loadtest line and run the script

export TESTDATA=call_center

#./tpc.sh test
#./tpc.sh removedata
#./tpc.sh createtables
./tpc.sh loadtest
#./tpc.sh testverify
#./tpc.sh testquery

#./tpc.sh loaddata
#./tpc.sh verifyload
#./tpc.sh runqueries

Step 4, verify data loaded

The test compares number of lines in text input file and number of rows in the target table

Uncomment ./tpc.sh testverify line and run the script

export TESTDATA=call_center

#./tpc.sh test
#./tpc.sh removedata
#./tpc.sh createtables
#./tpc.sh loadtest
./tpc.sh testverify
#./tpc.sh testquery

#./tpc.sh loaddata
#./tpc.sh verifyload
#./tpc.sh runqueries

Step 5, load all data

Uncomment ./tpc.sh loaddata line and run the script

#./tpc.sh test
#./tpc.sh removedata
#./tpc.sh createtables
#./tpc.sh loadtest
#./tpc.sh testverify
#./tpc.sh testquery

./tpc.sh loaddata
#./tpc.sh verifyload
#./tpc.sh runqueries

Step 6, verify loaded data

This test compares number of line in input text files against corresponding target tables.

Uncomment ./tpc.sh verifyload line and run the script

#./tpc.sh test
#./tpc.sh removedata
#./tpc.sh createtables
#./tpc.sh loadtest
#./tpc.sh testverify
#./tpc.sh testquery

#./tpc.sh loaddata
./tpc.sh verifyload
#./tpc.sh runqueries

Step 7, run a single query

Run single query as a test

Uncomment ./tpc.sh testquery line and run the script

export TESTQUERY=98

#./tpc.sh test
#./tpc.sh removedata
#./tpc.sh createtables
#./tpc.sh loadtest
#./tpc.sh testverify
./tpc.sh testquery

#./tpc.sh loaddata
#./tpc.sh verifyload
#./tpc.sh runqueries

Step 7, run all queries, QUALIFY TEST

Uncomment ./tpc.sh runqueries line and run the script

#./tpc.sh test
#./tpc.sh removedata
#./tpc.sh createtables
#./tpc.sh loadtest
#./tpc.sh testverify
#./tpc.sh testquery

#./tpc.sh loaddata
#./tpc.sh verifyload
./tpc.sh runqueries

Step 8, analyze the result

The test result, queries and the time of the execution, is stored in the directory: TCPROOT/work/{database}queries/{database}.result
Important: every runqueries truncated the result file. In order to keep historical result, one has to make a manual copy of the result file.

Power Test

Prepare an appropriate data set using dsdgen utility
loaddata
verifyload (just in case)
runqueries
pick up the test result

Throughput Test

Preparation

(TPC-DS specification : 7.4.6)
Throughput Test measures the ability of the database engine to process queries with multiple users. Assume the minimal number of concurrent streams, 4.
A separate batch file ptest.sh should be launched to emulate one session. The ptest.sh accepts a parameter, number 0-3 meaning the stream number. Example, stream number 3.

./ptest.sh 3

Prepare the query streams.

./dsqgen -VERBOSE Y -DIRECTORY ../query_templates -INPUT ../query_templates/templates.lst -OUTPUT_DIR ../work/{directory} -DIALECT {dialect} -STREAMS 4 -sc 100

Example for DB2:

./dsqgen -VERBOSE Y -DIRECTORY ../query_templates -INPUT ../query_templates/templates.lst -OUTPUT_DIR ../work/db2queries -DIALECT db2 -STREAMS 4 -sc 100

The command prepares four streams of queriers in ../work/db2queries directory.

 query_0.sql
 query_1.sql
 query_2.sql
 query_3.sql

Run the test

Next step is to launch four ptest.sh session with parameters 0,1,2, and 3 running in parallel. Every session runs appropriate list of queries and produce a separate result report. For instance:

./ptest 2

It will execute query_2.sql query set and outputs the result in work/db2queries/db2sql.result2

Result

When all sessions complete, evaluate the result in work/db2queries.

Name		Name	Last commit message	Last commit date
Latest commit History 274 Commits
.vscode		.vscode
HadoopAuth		HadoopAuth
RunQueries		RunQueries
answer_sets		answer_sets
db		db
doc		doc
env.templates		env.templates
img		img
prepres		prepres
proc		proc
qualification		qualification
qualifpyt		qualifpyt
qualifres		qualifres
sh.templates		sh.templates
tpcgraph/src		tpcgraph/src
tpcplot		tpcplot
.giosaverxpxSu		.giosaverxpxSu
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
l		l
tpc.sh		tpc.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mytpcds

Inspiration

TPC-DS preparation

Tool description.

Java Query Runner

Tool configuarion

Troubleshooting

The queries

Global settings

Configure database properties

Qualify Test

Qualify dataset

Qualify queries

Step 1, test the connection

Step 2, create tables

Step 3, load single table as a test

Step 4, verify data loaded

Step 5, load all data

Step 6, verify loaded data

Step 7, run a single query

Step 7, run all queries, QUALIFY TEST

Step 8, analyze the result

Power Test

Throughput Test

Preparation

Prepare the query streams.

Run the test

Result

About

Releases

Packages

Languages

License

stanislawbartkowski/mytpcds

Folders and files

Latest commit

History

Repository files navigation

mytpcds

Inspiration

TPC-DS preparation

Tool description.

Java Query Runner

Tool configuarion

Troubleshooting

The queries

Global settings

Configure database properties

Qualify Test

Qualify dataset

Qualify queries

Step 1, test the connection

Step 2, create tables

Step 3, load single table as a test

Step 4, verify data loaded

Step 5, load all data

Step 6, verify loaded data

Step 7, run a single query

Step 7, run all queries, QUALIFY TEST

Step 8, analyze the result

Power Test

Throughput Test

Preparation

Prepare the query streams.

Run the test

Result

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages