Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1578: Fix SparkBenchmark on sales data according to SPARK-40918 #1734

Closed
wants to merge 1 commit into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jan 9, 2024

What changes were proposed in this pull request?

This PR aims to fix SparkBenchmark according to the requirement of SPARK-40918.

Note that this fixes the synthetic benchmark on Sales data. For the other real-life dataset (github and taxi), we will revisit.

Why are the changes needed?

  1. Generate Sales data
$ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc -d sales -s 1000000
  1. Run Spark Benchmark
$ java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data -d sales -f orc
# Run complete. Total time: 00:10:45

Benchmark                                  (compression)  (dataset)  (format)  Mode  Cnt        Score       Error  Units
SparkBenchmark.fullRead                               gz      sales       orc  avgt    5   686792.235 ±  4398.971  us/op
SparkBenchmark.fullRead:bytesPerRecord                gz      sales       orc  avgt    5        0.192                  #
SparkBenchmark.fullRead:ops                           gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                     gz      sales       orc  avgt    5        0.687 ±     0.004  us/op
SparkBenchmark.fullRead:records                       gz      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.fullRead                           snappy      sales       orc  avgt    5   286166.380 ± 19864.429  us/op
SparkBenchmark.fullRead:bytesPerRecord            snappy      sales       orc  avgt    5        0.201                  #
SparkBenchmark.fullRead:ops                       snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                 snappy      sales       orc  avgt    5        0.286 ±     0.020  us/op
SparkBenchmark.fullRead:records                   snappy      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.fullRead                             zstd      sales       orc  avgt    5   384394.233 ± 10057.315  us/op
SparkBenchmark.fullRead:bytesPerRecord              zstd      sales       orc  avgt    5        0.192                  #
SparkBenchmark.fullRead:ops                         zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                   zstd      sales       orc  avgt    5        0.384 ±     0.010  us/op
SparkBenchmark.fullRead:records                     zstd      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                            gz      sales       orc  avgt    5    41683.914 ±  4046.077  us/op
SparkBenchmark.partialRead:bytesPerRecord             gz      sales       orc  avgt    5        0.192                  #
SparkBenchmark.partialRead:ops                        gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord                  gz      sales       orc  avgt    5        0.042 ±     0.004  us/op
SparkBenchmark.partialRead:records                    gz      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                        snappy      sales       orc  avgt    5    23981.054 ± 17874.229  us/op
SparkBenchmark.partialRead:bytesPerRecord         snappy      sales       orc  avgt    5        0.201                  #
SparkBenchmark.partialRead:ops                    snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord              snappy      sales       orc  avgt    5        0.024 ±     0.018  us/op
SparkBenchmark.partialRead:records                snappy      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                          zstd      sales       orc  avgt    5    41433.277 ± 25110.021  us/op
SparkBenchmark.partialRead:bytesPerRecord           zstd      sales       orc  avgt    5        0.192                  #
SparkBenchmark.partialRead:ops                      zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord                zstd      sales       orc  avgt    5        0.041 ±     0.025  us/op
SparkBenchmark.partialRead:records                  zstd      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.pushDown                               gz      sales       orc  avgt    5    23760.997 ±   833.034  us/op
SparkBenchmark.pushDown:bytesPerRecord                gz      sales       orc  avgt    5       19.153                  #
SparkBenchmark.pushDown:ops                           gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                     gz      sales       orc  avgt    5        2.376 ±     0.083  us/op
SparkBenchmark.pushDown:records                       gz      sales       orc  avgt    5    50000.000                  #
SparkBenchmark.pushDown                           snappy      sales       orc  avgt    5    14062.508 ±  1793.691  us/op
SparkBenchmark.pushDown:bytesPerRecord            snappy      sales       orc  avgt    5       20.105                  #
SparkBenchmark.pushDown:ops                       snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                 snappy      sales       orc  avgt    5        1.406 ±     0.179  us/op
SparkBenchmark.pushDown:records                   snappy      sales       orc  avgt    5    50000.000                  #
SparkBenchmark.pushDown                             zstd      sales       orc  avgt    5    15597.651 ±  1307.246  us/op
SparkBenchmark.pushDown:bytesPerRecord              zstd      sales       orc  avgt    5       19.213                  #
SparkBenchmark.pushDown:ops                         zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                   zstd      sales       orc  avgt    5        1.560 ±     0.131  us/op
SparkBenchmark.pushDown:records                     zstd      sales       orc  avgt    5    50000.000                  #

How was this patch tested?

Pass the CIs.

@github-actions github-actions bot added the JAVA label Jan 9, 2024
@dongjoon-hyun dongjoon-hyun changed the title ORC-1578: Fix SparkBenchmark according to SPARK-40918 ORC-1578: Fix SparkBenchmark on sales data according to SPARK-40918 Jan 9, 2024
dongjoon-hyun added a commit that referenced this pull request Jan 9, 2024
### What changes were proposed in this pull request?

This PR aims to fix `SparkBenchmark` according to the requirement of SPARK-40918.

Note that this fixes the synthetic benchmark on `Sales` data. For the other real-life dataset (`github` and `taxi`), we will revisit.

### Why are the changes needed?

1. Generate `Sales` data
```
$ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc -d sales -s 1000000
```

2. Run Spark Benchmark
```
$ java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data -d sales -f orc
# Run complete. Total time: 00:10:45

Benchmark                                  (compression)  (dataset)  (format)  Mode  Cnt        Score       Error  Units
SparkBenchmark.fullRead                               gz      sales       orc  avgt    5   686792.235 ±  4398.971  us/op
SparkBenchmark.fullRead:bytesPerRecord                gz      sales       orc  avgt    5        0.192                  #
SparkBenchmark.fullRead:ops                           gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                     gz      sales       orc  avgt    5        0.687 ±     0.004  us/op
SparkBenchmark.fullRead:records                       gz      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.fullRead                           snappy      sales       orc  avgt    5   286166.380 ± 19864.429  us/op
SparkBenchmark.fullRead:bytesPerRecord            snappy      sales       orc  avgt    5        0.201                  #
SparkBenchmark.fullRead:ops                       snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                 snappy      sales       orc  avgt    5        0.286 ±     0.020  us/op
SparkBenchmark.fullRead:records                   snappy      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.fullRead                             zstd      sales       orc  avgt    5   384394.233 ± 10057.315  us/op
SparkBenchmark.fullRead:bytesPerRecord              zstd      sales       orc  avgt    5        0.192                  #
SparkBenchmark.fullRead:ops                         zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                   zstd      sales       orc  avgt    5        0.384 ±     0.010  us/op
SparkBenchmark.fullRead:records                     zstd      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                            gz      sales       orc  avgt    5    41683.914 ±  4046.077  us/op
SparkBenchmark.partialRead:bytesPerRecord             gz      sales       orc  avgt    5        0.192                  #
SparkBenchmark.partialRead:ops                        gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord                  gz      sales       orc  avgt    5        0.042 ±     0.004  us/op
SparkBenchmark.partialRead:records                    gz      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                        snappy      sales       orc  avgt    5    23981.054 ± 17874.229  us/op
SparkBenchmark.partialRead:bytesPerRecord         snappy      sales       orc  avgt    5        0.201                  #
SparkBenchmark.partialRead:ops                    snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord              snappy      sales       orc  avgt    5        0.024 ±     0.018  us/op
SparkBenchmark.partialRead:records                snappy      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                          zstd      sales       orc  avgt    5    41433.277 ± 25110.021  us/op
SparkBenchmark.partialRead:bytesPerRecord           zstd      sales       orc  avgt    5        0.192                  #
SparkBenchmark.partialRead:ops                      zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord                zstd      sales       orc  avgt    5        0.041 ±     0.025  us/op
SparkBenchmark.partialRead:records                  zstd      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.pushDown                               gz      sales       orc  avgt    5    23760.997 ±   833.034  us/op
SparkBenchmark.pushDown:bytesPerRecord                gz      sales       orc  avgt    5       19.153                  #
SparkBenchmark.pushDown:ops                           gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                     gz      sales       orc  avgt    5        2.376 ±     0.083  us/op
SparkBenchmark.pushDown:records                       gz      sales       orc  avgt    5    50000.000                  #
SparkBenchmark.pushDown                           snappy      sales       orc  avgt    5    14062.508 ±  1793.691  us/op
SparkBenchmark.pushDown:bytesPerRecord            snappy      sales       orc  avgt    5       20.105                  #
SparkBenchmark.pushDown:ops                       snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                 snappy      sales       orc  avgt    5        1.406 ±     0.179  us/op
SparkBenchmark.pushDown:records                   snappy      sales       orc  avgt    5    50000.000                  #
SparkBenchmark.pushDown                             zstd      sales       orc  avgt    5    15597.651 ±  1307.246  us/op
SparkBenchmark.pushDown:bytesPerRecord              zstd      sales       orc  avgt    5       19.213                  #
SparkBenchmark.pushDown:ops                         zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                   zstd      sales       orc  avgt    5        1.560 ±     0.131  us/op
SparkBenchmark.pushDown:records                     zstd      sales       orc  avgt    5    50000.000                  #
```

### How was this patch tested?

Pass the CIs.

Closes #1734 from dongjoon-hyun/ORC-1578.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit fbe49d7)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun added a commit that referenced this pull request Jan 9, 2024
### What changes were proposed in this pull request?

This PR aims to fix `SparkBenchmark` according to the requirement of SPARK-40918.

Note that this fixes the synthetic benchmark on `Sales` data. For the other real-life dataset (`github` and `taxi`), we will revisit.

### Why are the changes needed?

1. Generate `Sales` data
```
$ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc -d sales -s 1000000
```

2. Run Spark Benchmark
```
$ java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data -d sales -f orc
# Run complete. Total time: 00:10:45

Benchmark                                  (compression)  (dataset)  (format)  Mode  Cnt        Score       Error  Units
SparkBenchmark.fullRead                               gz      sales       orc  avgt    5   686792.235 ±  4398.971  us/op
SparkBenchmark.fullRead:bytesPerRecord                gz      sales       orc  avgt    5        0.192                  #
SparkBenchmark.fullRead:ops                           gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                     gz      sales       orc  avgt    5        0.687 ±     0.004  us/op
SparkBenchmark.fullRead:records                       gz      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.fullRead                           snappy      sales       orc  avgt    5   286166.380 ± 19864.429  us/op
SparkBenchmark.fullRead:bytesPerRecord            snappy      sales       orc  avgt    5        0.201                  #
SparkBenchmark.fullRead:ops                       snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                 snappy      sales       orc  avgt    5        0.286 ±     0.020  us/op
SparkBenchmark.fullRead:records                   snappy      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.fullRead                             zstd      sales       orc  avgt    5   384394.233 ± 10057.315  us/op
SparkBenchmark.fullRead:bytesPerRecord              zstd      sales       orc  avgt    5        0.192                  #
SparkBenchmark.fullRead:ops                         zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                   zstd      sales       orc  avgt    5        0.384 ±     0.010  us/op
SparkBenchmark.fullRead:records                     zstd      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                            gz      sales       orc  avgt    5    41683.914 ±  4046.077  us/op
SparkBenchmark.partialRead:bytesPerRecord             gz      sales       orc  avgt    5        0.192                  #
SparkBenchmark.partialRead:ops                        gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord                  gz      sales       orc  avgt    5        0.042 ±     0.004  us/op
SparkBenchmark.partialRead:records                    gz      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                        snappy      sales       orc  avgt    5    23981.054 ± 17874.229  us/op
SparkBenchmark.partialRead:bytesPerRecord         snappy      sales       orc  avgt    5        0.201                  #
SparkBenchmark.partialRead:ops                    snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord              snappy      sales       orc  avgt    5        0.024 ±     0.018  us/op
SparkBenchmark.partialRead:records                snappy      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                          zstd      sales       orc  avgt    5    41433.277 ± 25110.021  us/op
SparkBenchmark.partialRead:bytesPerRecord           zstd      sales       orc  avgt    5        0.192                  #
SparkBenchmark.partialRead:ops                      zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord                zstd      sales       orc  avgt    5        0.041 ±     0.025  us/op
SparkBenchmark.partialRead:records                  zstd      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.pushDown                               gz      sales       orc  avgt    5    23760.997 ±   833.034  us/op
SparkBenchmark.pushDown:bytesPerRecord                gz      sales       orc  avgt    5       19.153                  #
SparkBenchmark.pushDown:ops                           gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                     gz      sales       orc  avgt    5        2.376 ±     0.083  us/op
SparkBenchmark.pushDown:records                       gz      sales       orc  avgt    5    50000.000                  #
SparkBenchmark.pushDown                           snappy      sales       orc  avgt    5    14062.508 ±  1793.691  us/op
SparkBenchmark.pushDown:bytesPerRecord            snappy      sales       orc  avgt    5       20.105                  #
SparkBenchmark.pushDown:ops                       snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                 snappy      sales       orc  avgt    5        1.406 ±     0.179  us/op
SparkBenchmark.pushDown:records                   snappy      sales       orc  avgt    5    50000.000                  #
SparkBenchmark.pushDown                             zstd      sales       orc  avgt    5    15597.651 ±  1307.246  us/op
SparkBenchmark.pushDown:bytesPerRecord              zstd      sales       orc  avgt    5       19.213                  #
SparkBenchmark.pushDown:ops                         zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                   zstd      sales       orc  avgt    5        1.560 ±     0.131  us/op
SparkBenchmark.pushDown:records                     zstd      sales       orc  avgt    5    50000.000                  #
```

### How was this patch tested?

Pass the CIs.

Closes #1734 from dongjoon-hyun/ORC-1578.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit fbe49d7)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun added a commit that referenced this pull request Jan 9, 2024
### What changes were proposed in this pull request?

This PR aims to fix `SparkBenchmark` according to the requirement of SPARK-40918.

Note that this fixes the synthetic benchmark on `Sales` data. For the other real-life dataset (`github` and `taxi`), we will revisit.

### Why are the changes needed?

1. Generate `Sales` data
```
$ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc -d sales -s 1000000
```

2. Run Spark Benchmark
```
$ java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data -d sales -f orc
# Run complete. Total time: 00:10:45

Benchmark                                  (compression)  (dataset)  (format)  Mode  Cnt        Score       Error  Units
SparkBenchmark.fullRead                               gz      sales       orc  avgt    5   686792.235 ±  4398.971  us/op
SparkBenchmark.fullRead:bytesPerRecord                gz      sales       orc  avgt    5        0.192                  #
SparkBenchmark.fullRead:ops                           gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                     gz      sales       orc  avgt    5        0.687 ±     0.004  us/op
SparkBenchmark.fullRead:records                       gz      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.fullRead                           snappy      sales       orc  avgt    5   286166.380 ± 19864.429  us/op
SparkBenchmark.fullRead:bytesPerRecord            snappy      sales       orc  avgt    5        0.201                  #
SparkBenchmark.fullRead:ops                       snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                 snappy      sales       orc  avgt    5        0.286 ±     0.020  us/op
SparkBenchmark.fullRead:records                   snappy      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.fullRead                             zstd      sales       orc  avgt    5   384394.233 ± 10057.315  us/op
SparkBenchmark.fullRead:bytesPerRecord              zstd      sales       orc  avgt    5        0.192                  #
SparkBenchmark.fullRead:ops                         zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                   zstd      sales       orc  avgt    5        0.384 ±     0.010  us/op
SparkBenchmark.fullRead:records                     zstd      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                            gz      sales       orc  avgt    5    41683.914 ±  4046.077  us/op
SparkBenchmark.partialRead:bytesPerRecord             gz      sales       orc  avgt    5        0.192                  #
SparkBenchmark.partialRead:ops                        gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord                  gz      sales       orc  avgt    5        0.042 ±     0.004  us/op
SparkBenchmark.partialRead:records                    gz      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                        snappy      sales       orc  avgt    5    23981.054 ± 17874.229  us/op
SparkBenchmark.partialRead:bytesPerRecord         snappy      sales       orc  avgt    5        0.201                  #
SparkBenchmark.partialRead:ops                    snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord              snappy      sales       orc  avgt    5        0.024 ±     0.018  us/op
SparkBenchmark.partialRead:records                snappy      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                          zstd      sales       orc  avgt    5    41433.277 ± 25110.021  us/op
SparkBenchmark.partialRead:bytesPerRecord           zstd      sales       orc  avgt    5        0.192                  #
SparkBenchmark.partialRead:ops                      zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord                zstd      sales       orc  avgt    5        0.041 ±     0.025  us/op
SparkBenchmark.partialRead:records                  zstd      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.pushDown                               gz      sales       orc  avgt    5    23760.997 ±   833.034  us/op
SparkBenchmark.pushDown:bytesPerRecord                gz      sales       orc  avgt    5       19.153                  #
SparkBenchmark.pushDown:ops                           gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                     gz      sales       orc  avgt    5        2.376 ±     0.083  us/op
SparkBenchmark.pushDown:records                       gz      sales       orc  avgt    5    50000.000                  #
SparkBenchmark.pushDown                           snappy      sales       orc  avgt    5    14062.508 ±  1793.691  us/op
SparkBenchmark.pushDown:bytesPerRecord            snappy      sales       orc  avgt    5       20.105                  #
SparkBenchmark.pushDown:ops                       snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                 snappy      sales       orc  avgt    5        1.406 ±     0.179  us/op
SparkBenchmark.pushDown:records                   snappy      sales       orc  avgt    5    50000.000                  #
SparkBenchmark.pushDown                             zstd      sales       orc  avgt    5    15597.651 ±  1307.246  us/op
SparkBenchmark.pushDown:bytesPerRecord              zstd      sales       orc  avgt    5       19.213                  #
SparkBenchmark.pushDown:ops                         zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                   zstd      sales       orc  avgt    5        1.560 ±     0.131  us/op
SparkBenchmark.pushDown:records                     zstd      sales       orc  avgt    5    50000.000                  #
```

### How was this patch tested?

Pass the CIs.

Closes #1734 from dongjoon-hyun/ORC-1578.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit fbe49d7)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun added a commit that referenced this pull request Jan 9, 2024
### What changes were proposed in this pull request?

This PR aims to fix `SparkBenchmark` according to the requirement of SPARK-40918.

Note that this fixes the synthetic benchmark on `Sales` data. For the other real-life dataset (`github` and `taxi`), we will revisit.

### Why are the changes needed?

1. Generate `Sales` data
```
$ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc -d sales -s 1000000
```

2. Run Spark Benchmark
```
$ java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data -d sales -f orc
# Run complete. Total time: 00:10:45

Benchmark                                  (compression)  (dataset)  (format)  Mode  Cnt        Score       Error  Units
SparkBenchmark.fullRead                               gz      sales       orc  avgt    5   686792.235 ±  4398.971  us/op
SparkBenchmark.fullRead:bytesPerRecord                gz      sales       orc  avgt    5        0.192                  #
SparkBenchmark.fullRead:ops                           gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                     gz      sales       orc  avgt    5        0.687 ±     0.004  us/op
SparkBenchmark.fullRead:records                       gz      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.fullRead                           snappy      sales       orc  avgt    5   286166.380 ± 19864.429  us/op
SparkBenchmark.fullRead:bytesPerRecord            snappy      sales       orc  avgt    5        0.201                  #
SparkBenchmark.fullRead:ops                       snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                 snappy      sales       orc  avgt    5        0.286 ±     0.020  us/op
SparkBenchmark.fullRead:records                   snappy      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.fullRead                             zstd      sales       orc  avgt    5   384394.233 ± 10057.315  us/op
SparkBenchmark.fullRead:bytesPerRecord              zstd      sales       orc  avgt    5        0.192                  #
SparkBenchmark.fullRead:ops                         zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                   zstd      sales       orc  avgt    5        0.384 ±     0.010  us/op
SparkBenchmark.fullRead:records                     zstd      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                            gz      sales       orc  avgt    5    41683.914 ±  4046.077  us/op
SparkBenchmark.partialRead:bytesPerRecord             gz      sales       orc  avgt    5        0.192                  #
SparkBenchmark.partialRead:ops                        gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord                  gz      sales       orc  avgt    5        0.042 ±     0.004  us/op
SparkBenchmark.partialRead:records                    gz      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                        snappy      sales       orc  avgt    5    23981.054 ± 17874.229  us/op
SparkBenchmark.partialRead:bytesPerRecord         snappy      sales       orc  avgt    5        0.201                  #
SparkBenchmark.partialRead:ops                    snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord              snappy      sales       orc  avgt    5        0.024 ±     0.018  us/op
SparkBenchmark.partialRead:records                snappy      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                          zstd      sales       orc  avgt    5    41433.277 ± 25110.021  us/op
SparkBenchmark.partialRead:bytesPerRecord           zstd      sales       orc  avgt    5        0.192                  #
SparkBenchmark.partialRead:ops                      zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord                zstd      sales       orc  avgt    5        0.041 ±     0.025  us/op
SparkBenchmark.partialRead:records                  zstd      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.pushDown                               gz      sales       orc  avgt    5    23760.997 ±   833.034  us/op
SparkBenchmark.pushDown:bytesPerRecord                gz      sales       orc  avgt    5       19.153                  #
SparkBenchmark.pushDown:ops                           gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                     gz      sales       orc  avgt    5        2.376 ±     0.083  us/op
SparkBenchmark.pushDown:records                       gz      sales       orc  avgt    5    50000.000                  #
SparkBenchmark.pushDown                           snappy      sales       orc  avgt    5    14062.508 ±  1793.691  us/op
SparkBenchmark.pushDown:bytesPerRecord            snappy      sales       orc  avgt    5       20.105                  #
SparkBenchmark.pushDown:ops                       snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                 snappy      sales       orc  avgt    5        1.406 ±     0.179  us/op
SparkBenchmark.pushDown:records                   snappy      sales       orc  avgt    5    50000.000                  #
SparkBenchmark.pushDown                             zstd      sales       orc  avgt    5    15597.651 ±  1307.246  us/op
SparkBenchmark.pushDown:bytesPerRecord              zstd      sales       orc  avgt    5       19.213                  #
SparkBenchmark.pushDown:ops                         zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                   zstd      sales       orc  avgt    5        1.560 ±     0.131  us/op
SparkBenchmark.pushDown:records                     zstd      sales       orc  avgt    5    50000.000                  #
```

### How was this patch tested?

Pass the CIs.

Closes #1734 from dongjoon-hyun/ORC-1578.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit fbe49d7)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun dongjoon-hyun added this to the 1.7.11 milestone Jan 9, 2024
cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
### What changes were proposed in this pull request?

This PR aims to fix `SparkBenchmark` according to the requirement of SPARK-40918.

Note that this fixes the synthetic benchmark on `Sales` data. For the other real-life dataset (`github` and `taxi`), we will revisit.

### Why are the changes needed?

1. Generate `Sales` data
```
$ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc -d sales -s 1000000
```

2. Run Spark Benchmark
```
$ java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data -d sales -f orc
# Run complete. Total time: 00:10:45

Benchmark                                  (compression)  (dataset)  (format)  Mode  Cnt        Score       Error  Units
SparkBenchmark.fullRead                               gz      sales       orc  avgt    5   686792.235 ±  4398.971  us/op
SparkBenchmark.fullRead:bytesPerRecord                gz      sales       orc  avgt    5        0.192                  #
SparkBenchmark.fullRead:ops                           gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                     gz      sales       orc  avgt    5        0.687 ±     0.004  us/op
SparkBenchmark.fullRead:records                       gz      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.fullRead                           snappy      sales       orc  avgt    5   286166.380 ± 19864.429  us/op
SparkBenchmark.fullRead:bytesPerRecord            snappy      sales       orc  avgt    5        0.201                  #
SparkBenchmark.fullRead:ops                       snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                 snappy      sales       orc  avgt    5        0.286 ±     0.020  us/op
SparkBenchmark.fullRead:records                   snappy      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.fullRead                             zstd      sales       orc  avgt    5   384394.233 ± 10057.315  us/op
SparkBenchmark.fullRead:bytesPerRecord              zstd      sales       orc  avgt    5        0.192                  #
SparkBenchmark.fullRead:ops                         zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.fullRead:perRecord                   zstd      sales       orc  avgt    5        0.384 ±     0.010  us/op
SparkBenchmark.fullRead:records                     zstd      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                            gz      sales       orc  avgt    5    41683.914 ±  4046.077  us/op
SparkBenchmark.partialRead:bytesPerRecord             gz      sales       orc  avgt    5        0.192                  #
SparkBenchmark.partialRead:ops                        gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord                  gz      sales       orc  avgt    5        0.042 ±     0.004  us/op
SparkBenchmark.partialRead:records                    gz      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                        snappy      sales       orc  avgt    5    23981.054 ± 17874.229  us/op
SparkBenchmark.partialRead:bytesPerRecord         snappy      sales       orc  avgt    5        0.201                  #
SparkBenchmark.partialRead:ops                    snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord              snappy      sales       orc  avgt    5        0.024 ±     0.018  us/op
SparkBenchmark.partialRead:records                snappy      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.partialRead                          zstd      sales       orc  avgt    5    41433.277 ± 25110.021  us/op
SparkBenchmark.partialRead:bytesPerRecord           zstd      sales       orc  avgt    5        0.192                  #
SparkBenchmark.partialRead:ops                      zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.partialRead:perRecord                zstd      sales       orc  avgt    5        0.041 ±     0.025  us/op
SparkBenchmark.partialRead:records                  zstd      sales       orc  avgt    5  5000000.000                  #
SparkBenchmark.pushDown                               gz      sales       orc  avgt    5    23760.997 ±   833.034  us/op
SparkBenchmark.pushDown:bytesPerRecord                gz      sales       orc  avgt    5       19.153                  #
SparkBenchmark.pushDown:ops                           gz      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                     gz      sales       orc  avgt    5        2.376 ±     0.083  us/op
SparkBenchmark.pushDown:records                       gz      sales       orc  avgt    5    50000.000                  #
SparkBenchmark.pushDown                           snappy      sales       orc  avgt    5    14062.508 ±  1793.691  us/op
SparkBenchmark.pushDown:bytesPerRecord            snappy      sales       orc  avgt    5       20.105                  #
SparkBenchmark.pushDown:ops                       snappy      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                 snappy      sales       orc  avgt    5        1.406 ±     0.179  us/op
SparkBenchmark.pushDown:records                   snappy      sales       orc  avgt    5    50000.000                  #
SparkBenchmark.pushDown                             zstd      sales       orc  avgt    5    15597.651 ±  1307.246  us/op
SparkBenchmark.pushDown:bytesPerRecord              zstd      sales       orc  avgt    5       19.213                  #
SparkBenchmark.pushDown:ops                         zstd      sales       orc  avgt    5       40.000                  #
SparkBenchmark.pushDown:perRecord                   zstd      sales       orc  avgt    5        1.560 ±     0.131  us/op
SparkBenchmark.pushDown:records                     zstd      sales       orc  avgt    5    50000.000                  #
```

### How was this patch tested?

Pass the CIs.

Closes apache#1734 from dongjoon-hyun/ORC-1578.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant