ORC-1578: Fix `SparkBenchmark` on `sales` data according to SPARK-40918 #1734

dongjoon-hyun · 2024-01-09T03:00:31Z

What changes were proposed in this pull request?

This PR aims to fix SparkBenchmark according to the requirement of SPARK-40918.

Note that this fixes the synthetic benchmark on Sales data. For the other real-life dataset (github and taxi), we will revisit.

Why are the changes needed?

Generate Sales data

$ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc -d sales -s 1000000

Run Spark Benchmark

$ java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data -d sales -f orc
# Run complete. Total time: 00:10:45

Benchmark                                  (compression)  (dataset)  (format)  Mode  Cnt        Score       Error  Units
SparkBenchmark.fullRead                               gz      sales       orc  avgt    5   686792.235 ±  4398.971  us/op
SparkBenchmark.fullRead:bytesPerRecord                gz      sales       orc  avgt    5        0.192                  #
SparkBenchmark.fullRead:ops                           gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                     gz      sales       orc  avgt    5        0.687 ±     0.004  us/op
SparkBenchmark.fullRead:records                       gz      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.fullRead                           snappy      sales       orc  avgt    5   286166.380 ± 19864.429  us/op
SparkBenchmark.fullRead:bytesPerRecord            snappy      sales       orc  avgt    5        0.201                  #
SparkBenchmark.fullRead:ops                       snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                 snappy      sales       orc  avgt    5        0.286 ±     0.020  us/op
SparkBenchmark.fullRead:records                   snappy      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.fullRead                             zstd      sales       orc  avgt    5   384394.233 ± 10057.315  us/op
SparkBenchmark.fullRead:bytesPerRecord              zstd      sales       orc  avgt    5        0.192                  #
SparkBenchmark.fullRead:ops                         zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                   zstd      sales       orc  avgt    5        0.384 ±     0.010  us/op
SparkBenchmark.fullRead:records                     zstd      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                            gz      sales       orc  avgt    5    41683.914 ±  4046.077  us/op
SparkBenchmark.partialRead:bytesPerRecord             gz      sales       orc  avgt    5        0.192                  #
SparkBenchmark.partialRead:ops                        gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord                  gz      sales       orc  avgt    5        0.042 ±     0.004  us/op
SparkBenchmark.partialRead:records                    gz      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                        snappy      sales       orc  avgt    5    23981.054 ± 17874.229  us/op
SparkBenchmark.partialRead:bytesPerRecord         snappy      sales       orc  avgt    5        0.201                  #
SparkBenchmark.partialRead:ops                    snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord              snappy      sales       orc  avgt    5        0.024 ±     0.018  us/op
SparkBenchmark.partialRead:records                snappy      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                          zstd      sales       orc  avgt    5    41433.277 ± 25110.021  us/op
SparkBenchmark.partialRead:bytesPerRecord           zstd      sales       orc  avgt    5        0.192                  #
SparkBenchmark.partialRead:ops                      zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord                zstd      sales       orc  avgt    5        0.041 ±     0.025  us/op
SparkBenchmark.partialRead:records                  zstd      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.pushDown                               gz      sales       orc  avgt    5    23760.997 ±   833.034  us/op
SparkBenchmark.pushDown:bytesPerRecord                gz      sales       orc  avgt    5       19.153                  #
SparkBenchmark.pushDown:ops                           gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                     gz      sales       orc  avgt    5        2.376 ±     0.083  us/op
SparkBenchmark.pushDown:records                       gz      sales       orc  avgt    5    50000.000                  #
SparkBenchmark.pushDown                           snappy      sales       orc  avgt    5    14062.508 ±  1793.691  us/op
SparkBenchmark.pushDown:bytesPerRecord            snappy      sales       orc  avgt    5       20.105                  #
SparkBenchmark.pushDown:ops                       snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                 snappy      sales       orc  avgt    5        1.406 ±     0.179  us/op
SparkBenchmark.pushDown:records                   snappy      sales       orc  avgt    5    50000.000                  #
SparkBenchmark.pushDown                             zstd      sales       orc  avgt    5    15597.651 ±  1307.246  us/op
SparkBenchmark.pushDown:bytesPerRecord              zstd      sales       orc  avgt    5       19.213                  #
SparkBenchmark.pushDown:ops                         zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                   zstd      sales       orc  avgt    5        1.560 ±     0.131  us/op
SparkBenchmark.pushDown:records                     zstd      sales       orc  avgt    5    50000.000                  #

How was this patch tested?

Pass the CIs.

### What changes were proposed in this pull request? This PR aims to fix `SparkBenchmark` according to the requirement of SPARK-40918. Note that this fixes the synthetic benchmark on `Sales` data. For the other real-life dataset (`github` and `taxi`), we will revisit. ### Why are the changes needed? 1. Generate `Sales` data ``` $ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc -d sales -s 1000000 ``` 2. Run Spark Benchmark ``` $ java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data -d sales -f orc # Run complete. Total time: 00:10:45 Benchmark (compression) (dataset) (format) Mode Cnt Score Error Units SparkBenchmark.fullRead gz sales orc avgt 5 686792.235 ± 4398.971 us/op SparkBenchmark.fullRead:bytesPerRecord gz sales orc avgt 5 0.192 # SparkBenchmark.fullRead:ops gz sales orc avgt 5 40.000 # SparkBenchmark.fullRead:perRecord gz sales orc avgt 5 0.687 ± 0.004 us/op SparkBenchmark.fullRead:records gz sales orc avgt 5 5000000.000 # SparkBenchmark.fullRead snappy sales orc avgt 5 286166.380 ± 19864.429 us/op SparkBenchmark.fullRead:bytesPerRecord snappy sales orc avgt 5 0.201 # SparkBenchmark.fullRead:ops snappy sales orc avgt 5 40.000 # SparkBenchmark.fullRead:perRecord snappy sales orc avgt 5 0.286 ± 0.020 us/op SparkBenchmark.fullRead:records snappy sales orc avgt 5 5000000.000 # SparkBenchmark.fullRead zstd sales orc avgt 5 384394.233 ± 10057.315 us/op SparkBenchmark.fullRead:bytesPerRecord zstd sales orc avgt 5 0.192 # SparkBenchmark.fullRead:ops zstd sales orc avgt 5 40.000 # SparkBenchmark.fullRead:perRecord zstd sales orc avgt 5 0.384 ± 0.010 us/op SparkBenchmark.fullRead:records zstd sales orc avgt 5 5000000.000 # SparkBenchmark.partialRead gz sales orc avgt 5 41683.914 ± 4046.077 us/op SparkBenchmark.partialRead:bytesPerRecord gz sales orc avgt 5 0.192 # SparkBenchmark.partialRead:ops gz sales orc avgt 5 40.000 # SparkBenchmark.partialRead:perRecord gz sales orc avgt 5 0.042 ± 0.004 us/op SparkBenchmark.partialRead:records gz sales orc avgt 5 5000000.000 # SparkBenchmark.partialRead snappy sales orc avgt 5 23981.054 ± 17874.229 us/op SparkBenchmark.partialRead:bytesPerRecord snappy sales orc avgt 5 0.201 # SparkBenchmark.partialRead:ops snappy sales orc avgt 5 40.000 # SparkBenchmark.partialRead:perRecord snappy sales orc avgt 5 0.024 ± 0.018 us/op SparkBenchmark.partialRead:records snappy sales orc avgt 5 5000000.000 # SparkBenchmark.partialRead zstd sales orc avgt 5 41433.277 ± 25110.021 us/op SparkBenchmark.partialRead:bytesPerRecord zstd sales orc avgt 5 0.192 # SparkBenchmark.partialRead:ops zstd sales orc avgt 5 40.000 # SparkBenchmark.partialRead:perRecord zstd sales orc avgt 5 0.041 ± 0.025 us/op SparkBenchmark.partialRead:records zstd sales orc avgt 5 5000000.000 # SparkBenchmark.pushDown gz sales orc avgt 5 23760.997 ± 833.034 us/op SparkBenchmark.pushDown:bytesPerRecord gz sales orc avgt 5 19.153 # SparkBenchmark.pushDown:ops gz sales orc avgt 5 40.000 # SparkBenchmark.pushDown:perRecord gz sales orc avgt 5 2.376 ± 0.083 us/op SparkBenchmark.pushDown:records gz sales orc avgt 5 50000.000 # SparkBenchmark.pushDown snappy sales orc avgt 5 14062.508 ± 1793.691 us/op SparkBenchmark.pushDown:bytesPerRecord snappy sales orc avgt 5 20.105 # SparkBenchmark.pushDown:ops snappy sales orc avgt 5 40.000 # SparkBenchmark.pushDown:perRecord snappy sales orc avgt 5 1.406 ± 0.179 us/op SparkBenchmark.pushDown:records snappy sales orc avgt 5 50000.000 # SparkBenchmark.pushDown zstd sales orc avgt 5 15597.651 ± 1307.246 us/op SparkBenchmark.pushDown:bytesPerRecord zstd sales orc avgt 5 19.213 # SparkBenchmark.pushDown:ops zstd sales orc avgt 5 40.000 # SparkBenchmark.pushDown:perRecord zstd sales orc avgt 5 1.560 ± 0.131 us/op SparkBenchmark.pushDown:records zstd sales orc avgt 5 50000.000 # ``` ### How was this patch tested? Pass the CIs. Closes #1734 from dongjoon-hyun/ORC-1578. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit fbe49d7) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR aims to fix `SparkBenchmark` according to the requirement of SPARK-40918. Note that this fixes the synthetic benchmark on `Sales` data. For the other real-life dataset (`github` and `taxi`), we will revisit. ### Why are the changes needed? 1. Generate `Sales` data ``` $ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc -d sales -s 1000000 ``` 2. Run Spark Benchmark ``` $ java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data -d sales -f orc # Run complete. Total time: 00:10:45 Benchmark (compression) (dataset) (format) Mode Cnt Score Error Units SparkBenchmark.fullRead gz sales orc avgt 5 686792.235 ± 4398.971 us/op SparkBenchmark.fullRead:bytesPerRecord gz sales orc avgt 5 0.192 # SparkBenchmark.fullRead:ops gz sales orc avgt 5 40.000 # SparkBenchmark.fullRead:perRecord gz sales orc avgt 5 0.687 ± 0.004 us/op SparkBenchmark.fullRead:records gz sales orc avgt 5 5000000.000 # SparkBenchmark.fullRead snappy sales orc avgt 5 286166.380 ± 19864.429 us/op SparkBenchmark.fullRead:bytesPerRecord snappy sales orc avgt 5 0.201 # SparkBenchmark.fullRead:ops snappy sales orc avgt 5 40.000 # SparkBenchmark.fullRead:perRecord snappy sales orc avgt 5 0.286 ± 0.020 us/op SparkBenchmark.fullRead:records snappy sales orc avgt 5 5000000.000 # SparkBenchmark.fullRead zstd sales orc avgt 5 384394.233 ± 10057.315 us/op SparkBenchmark.fullRead:bytesPerRecord zstd sales orc avgt 5 0.192 # SparkBenchmark.fullRead:ops zstd sales orc avgt 5 40.000 # SparkBenchmark.fullRead:perRecord zstd sales orc avgt 5 0.384 ± 0.010 us/op SparkBenchmark.fullRead:records zstd sales orc avgt 5 5000000.000 # SparkBenchmark.partialRead gz sales orc avgt 5 41683.914 ± 4046.077 us/op SparkBenchmark.partialRead:bytesPerRecord gz sales orc avgt 5 0.192 # SparkBenchmark.partialRead:ops gz sales orc avgt 5 40.000 # SparkBenchmark.partialRead:perRecord gz sales orc avgt 5 0.042 ± 0.004 us/op SparkBenchmark.partialRead:records gz sales orc avgt 5 5000000.000 # SparkBenchmark.partialRead snappy sales orc avgt 5 23981.054 ± 17874.229 us/op SparkBenchmark.partialRead:bytesPerRecord snappy sales orc avgt 5 0.201 # SparkBenchmark.partialRead:ops snappy sales orc avgt 5 40.000 # SparkBenchmark.partialRead:perRecord snappy sales orc avgt 5 0.024 ± 0.018 us/op SparkBenchmark.partialRead:records snappy sales orc avgt 5 5000000.000 # SparkBenchmark.partialRead zstd sales orc avgt 5 41433.277 ± 25110.021 us/op SparkBenchmark.partialRead:bytesPerRecord zstd sales orc avgt 5 0.192 # SparkBenchmark.partialRead:ops zstd sales orc avgt 5 40.000 # SparkBenchmark.partialRead:perRecord zstd sales orc avgt 5 0.041 ± 0.025 us/op SparkBenchmark.partialRead:records zstd sales orc avgt 5 5000000.000 # SparkBenchmark.pushDown gz sales orc avgt 5 23760.997 ± 833.034 us/op SparkBenchmark.pushDown:bytesPerRecord gz sales orc avgt 5 19.153 # SparkBenchmark.pushDown:ops gz sales orc avgt 5 40.000 # SparkBenchmark.pushDown:perRecord gz sales orc avgt 5 2.376 ± 0.083 us/op SparkBenchmark.pushDown:records gz sales orc avgt 5 50000.000 # SparkBenchmark.pushDown snappy sales orc avgt 5 14062.508 ± 1793.691 us/op SparkBenchmark.pushDown:bytesPerRecord snappy sales orc avgt 5 20.105 # SparkBenchmark.pushDown:ops snappy sales orc avgt 5 40.000 # SparkBenchmark.pushDown:perRecord snappy sales orc avgt 5 1.406 ± 0.179 us/op SparkBenchmark.pushDown:records snappy sales orc avgt 5 50000.000 # SparkBenchmark.pushDown zstd sales orc avgt 5 15597.651 ± 1307.246 us/op SparkBenchmark.pushDown:bytesPerRecord zstd sales orc avgt 5 19.213 # SparkBenchmark.pushDown:ops zstd sales orc avgt 5 40.000 # SparkBenchmark.pushDown:perRecord zstd sales orc avgt 5 1.560 ± 0.131 us/op SparkBenchmark.pushDown:records zstd sales orc avgt 5 50000.000 # ``` ### How was this patch tested? Pass the CIs. Closes apache#1734 from dongjoon-hyun/ORC-1578. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

ORC-1578: Fix SparkBenchmark according to SPARK-40918

20cb0b7

github-actions bot added the JAVA label Jan 9, 2024

dongjoon-hyun changed the title ~~ORC-1578: Fix SparkBenchmark according to SPARK-40918~~ ORC-1578: Fix SparkBenchmark on sales data according to SPARK-40918 Jan 9, 2024

dongjoon-hyun closed this in fbe49d7 Jan 9, 2024

dongjoon-hyun added this to the 1.7.11 milestone Jan 9, 2024

This was referenced Jan 9, 2024

ORC-1578: Fix SparkBenchmark on sales data according to SPARK-40918 #1737

Closed

ORC-1578: Fix SparkBenchmark on sales data according to SPARK-40918 #1736

Closed

ORC-1578: Fix SparkBenchmark on sales data according to SPARK-40918 #1735

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC-1578: Fix `SparkBenchmark` on `sales` data according to SPARK-40918 #1734

ORC-1578: Fix `SparkBenchmark` on `sales` data according to SPARK-40918 #1734

dongjoon-hyun commented Jan 9, 2024 •

edited

Loading

ORC-1578: Fix SparkBenchmark on sales data according to SPARK-40918 #1734

ORC-1578: Fix SparkBenchmark on sales data according to SPARK-40918 #1734

Conversation

dongjoon-hyun commented Jan 9, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

ORC-1578: Fix `SparkBenchmark` on `sales` data according to SPARK-40918 #1734

ORC-1578: Fix `SparkBenchmark` on `sales` data according to SPARK-40918 #1734

dongjoon-hyun commented Jan 9, 2024 •

edited

Loading