Skip to content

Commit 8f898a7

Browse files
authored
Add h2o window benchmark (#16003)
* h2o-window benchmark * Review: clarify h2o-window is an extended benchmark
1 parent 4efbeee commit 8f898a7

File tree

6 files changed

+218
-39
lines changed

6 files changed

+218
-39
lines changed

benchmarks/README.md

Lines changed: 32 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -545,9 +545,16 @@ cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p '...
545545
```
546546

547547

548-
## h2o benchmarks for groupby
548+
## h2o.ai benchmarks
549+
The h2o.ai benchmarks are a set of performance tests for groupby and join operations. Beyond the standard h2o benchmark, there is also an extended benchmark for window functions. These benchmarks use synthetic data with configurable sizes (small: 1e7 rows, medium: 1e8 rows, big: 1e9 rows) to evaluate DataFusion's performance across different data scales.
549550

550-
### Generate data for h2o benchmarks
551+
Reference:
552+
- [H2O AI Benchmark](https://duckdb.org/2023/04/14/h2oai.html)
553+
- [Extended window benchmark](https://duckdb.org/2024/06/26/benchmarks-over-time.html#window-functions-benchmark)
554+
555+
### h2o benchmarks for groupby
556+
557+
#### Generate data for h2o benchmarks
551558
There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`. The data is generated in the `data` directory.
552559

553560
1. Generate small data (1e7 rows)
@@ -567,7 +574,7 @@ There are three options for generating data for h2o benchmarks: `small`, `medium
567574
./bench.sh data h2o_big
568575
```
569576

570-
### Run h2o benchmarks
577+
#### Run h2o benchmarks
571578
There are three options for running h2o benchmarks: `small`, `medium`, and `big`.
572579
1. Run small data benchmark
573580
```bash
@@ -591,49 +598,46 @@ For example, to run query 1 with the small data generated above:
591598
cargo run --release --bin dfbench -- h2o --path ./benchmarks/data/h2o/G1_1e7_1e7_100_0.csv --query 1
592599
```
593600

594-
## h2o benchmarks for join
601+
### h2o benchmarks for join
595602

596-
### Generate data for h2o benchmarks
597603
There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`. The data is generated in the `data` directory.
598604

599-
1. Generate small data (4 table files, the largest is 1e7 rows)
605+
Here is a example to generate `small` dataset and run the benchmark. To run other
606+
dataset size configuration, change the command similar to the previous example.
607+
600608
```bash
609+
# Generate small data (4 table files, the largest is 1e7 rows)
601610
./bench.sh data h2o_small_join
611+
612+
# Run the benchmark
613+
./bench.sh run h2o_small_join
602614
```
603615

616+
To run a specific query with a specific join data paths, the data paths are including 4 table files.
604617

605-
2. Generate medium data (4 table files, the largest is 1e8 rows)
618+
For example, to run query 1 with the small data generated above:
606619
```bash
607-
./bench.sh data h2o_medium_join
620+
cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/join.sql --query 1
608621
```
609622

610-
3. Generate large data (4 table files, the largest is 1e9 rows)
611-
```bash
612-
./bench.sh data h2o_big_join
613-
```
623+
### Extended h2o benchmarks for window
614624

615-
### Run h2o benchmarks
616-
There are three options for running h2o benchmarks: `small`, `medium`, and `big`.
617-
1. Run small data benchmark
618-
```bash
619-
./bench.sh run h2o_small_join
620-
```
625+
This benchmark extends the h2o benchmark suite to evaluate window function performance. H2o window benchmark uses the same dataset as the h2o join benchmark. There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`.
621626

622-
2. Run medium data benchmark
623-
```bash
624-
./bench.sh run h2o_medium_join
625-
```
627+
Here is a example to generate `small` dataset and run the benchmark. To run other
628+
dataset size configuration, change the command similar to the previous example.
626629

627-
3. Run large data benchmark
628630
```bash
629-
./bench.sh run h2o_big_join
631+
# Generate small data
632+
./bench.sh data h2o_small_window
633+
634+
# Run the benchmark
635+
./bench.sh run h2o_small_window
630636
```
631637

632-
4. Run a specific query with a specific join data paths, the data paths are including 4 table files.
638+
To run a specific query with a specific window data paths, the data paths are including 4 table files (the same as h2o-join dataset)
633639

634640
For example, to run query 1 with the small data generated above:
635641
```bash
636-
cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/join.sql --query 1
642+
cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/window.sql --query 1
637643
```
638-
[1]: http://www.tpc.org/tpch/
639-
[2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

benchmarks/bench.sh

Lines changed: 38 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,9 @@ h2o_big: h2oai benchmark with large dataset (1e9 rows) for groupb
8787
h2o_small_join: h2oai benchmark with small dataset (1e7 rows) for join, default file format is csv
8888
h2o_medium_join: h2oai benchmark with medium dataset (1e8 rows) for join, default file format is csv
8989
h2o_big_join: h2oai benchmark with large dataset (1e9 rows) for join, default file format is csv
90+
h2o_small_window: Extended h2oai benchmark with small dataset (1e7 rows) for window, default file format is csv
91+
h2o_medium_window: Extended h2oai benchmark with medium dataset (1e8 rows) for window, default file format is csv
92+
h2o_big_window: Extended h2oai benchmark with large dataset (1e9 rows) for window, default file format is csv
9093
imdb: Join Order Benchmark (JOB) using the IMDB dataset converted to parquet
9194
9295
**********
@@ -205,6 +208,16 @@ main() {
205208
h2o_big_join)
206209
data_h2o_join "BIG" "CSV"
207210
;;
211+
# h2o window benchmark uses the same data as the h2o join
212+
h2o_small_window)
213+
data_h2o_join "SMALL" "CSV"
214+
;;
215+
h2o_medium_window)
216+
data_h2o_join "MEDIUM" "CSV"
217+
;;
218+
h2o_big_window)
219+
data_h2o_join "BIG" "CSV"
220+
;;
208221
external_aggr)
209222
# same data as for tpch
210223
data_tpch "1"
@@ -315,6 +328,15 @@ main() {
315328
h2o_big_join)
316329
run_h2o_join "BIG" "CSV" "join"
317330
;;
331+
h2o_small_window)
332+
run_h2o_window "SMALL" "CSV" "window"
333+
;;
334+
h2o_medium_window)
335+
run_h2o_window "MEDIUM" "CSV" "window"
336+
;;
337+
h2o_big_window)
338+
run_h2o_window "BIG" "CSV" "window"
339+
;;
318340
external_aggr)
319341
run_external_aggr
320342
;;
@@ -801,6 +823,7 @@ data_h2o_join() {
801823
deactivate
802824
}
803825

826+
# Runner for h2o groupby benchmark
804827
run_h2o() {
805828
# Default values for size and data format
806829
SIZE=${1:-"SMALL"}
@@ -843,7 +866,8 @@ run_h2o() {
843866
-o "${RESULTS_FILE}"
844867
}
845868

846-
run_h2o_join() {
869+
# Utility function to run h2o join/window benchmark
870+
h2o_runner() {
847871
# Default values for size and data format
848872
SIZE=${1:-"SMALL"}
849873
DATA_FORMAT=${2:-"CSV"}
@@ -852,10 +876,10 @@ run_h2o_join() {
852876

853877
# Data directory and results file path
854878
H2O_DIR="${DATA_DIR}/h2o"
855-
RESULTS_FILE="${RESULTS_DIR}/h2o_join.json"
879+
RESULTS_FILE="${RESULTS_DIR}/h2o_${RUN_Type}.json"
856880

857881
echo "RESULTS_FILE: ${RESULTS_FILE}"
858-
echo "Running h2o join benchmark..."
882+
echo "Running h2o ${RUN_Type} benchmark..."
859883

860884
# Set the file name based on the size
861885
case "$SIZE" in
@@ -883,7 +907,7 @@ run_h2o_join() {
883907
;;
884908
esac
885909

886-
# Set the query file name based on the RUN_Type
910+
# Set the query file name based on the RUN_Type
887911
QUERY_FILE="${SCRIPT_DIR}/queries/h2o/${RUN_Type}.sql"
888912

889913
$CARGO_COMMAND --bin dfbench -- h2o \
@@ -893,6 +917,16 @@ run_h2o_join() {
893917
-o "${RESULTS_FILE}"
894918
}
895919

920+
# Runners for h2o join benchmark
921+
run_h2o_join() {
922+
h2o_runner "$1" "$2" "join"
923+
}
924+
925+
# Runners for h2o join benchmark
926+
run_h2o_window() {
927+
h2o_runner "$1" "$2" "window"
928+
}
929+
896930
# Runs the external aggregation benchmark
897931
run_external_aggr() {
898932
# Use TPC-H SF1 dataset

benchmarks/queries/h2o/groupby.sql

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,19 @@
11
SELECT id1, SUM(v1) AS v1 FROM x GROUP BY id1;
2+
23
SELECT id1, id2, SUM(v1) AS v1 FROM x GROUP BY id1, id2;
4+
35
SELECT id3, SUM(v1) AS v1, AVG(v3) AS v3 FROM x GROUP BY id3;
6+
47
SELECT id4, AVG(v1) AS v1, AVG(v2) AS v2, AVG(v3) AS v3 FROM x GROUP BY id4;
8+
59
SELECT id6, SUM(v1) AS v1, SUM(v2) AS v2, SUM(v3) AS v3 FROM x GROUP BY id6;
10+
611
SELECT id4, id5, MEDIAN(v3) AS median_v3, STDDEV(v3) AS sd_v3 FROM x GROUP BY id4, id5;
12+
713
SELECT id3, MAX(v1) - MIN(v2) AS range_v1_v2 FROM x GROUP BY id3;
14+
815
SELECT id6, largest2_v3 FROM (SELECT id6, v3 AS largest2_v3, ROW_NUMBER() OVER (PARTITION BY id6 ORDER BY v3 DESC) AS order_v3 FROM x WHERE v3 IS NOT NULL) sub_query WHERE order_v3 <= 2;
16+
917
SELECT id2, id4, POWER(CORR(v1, v2), 2) AS r2 FROM x GROUP BY id2, id4;
10-
SELECT id1, id2, id3, id4, id5, id6, SUM(v3) AS v3, COUNT(*) AS count FROM x GROUP BY id1, id2, id3, id4, id5, id6;
18+
19+
SELECT id1, id2, id3, id4, id5, id6, SUM(v3) AS v3, COUNT(*) AS count FROM x GROUP BY id1, id2, id3, id4, id5, id6;

benchmarks/queries/h2o/join.sql

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
SELECT x.id1, x.id2, x.id3, x.id4 as xid4, small.id4 as smallid4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1;
2+
23
SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x INNER JOIN medium ON x.id2 = medium.id2;
4+
35
SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x LEFT JOIN medium ON x.id2 = medium.id2;
6+
47
SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x JOIN medium ON x.id5 = medium.id5;
5-
SELECT x.id1 as xid1, large.id1 as largeid1, x.id2 as xid2, large.id2 as largeid2, x.id3, x.id4 as xid4, large.id4 as largeid4, x.id5 as xid5, large.id5 as largeid5, x.id6 as xid6, large.id6 as largeid6, x.v1, large.v2 FROM x JOIN large ON x.id3 = large.id3;
8+
9+
SELECT x.id1 as xid1, large.id1 as largeid1, x.id2 as xid2, large.id2 as largeid2, x.id3, x.id4 as xid4, large.id4 as largeid4, x.id5 as xid5, large.id5 as largeid5, x.id6 as xid6, large.id6 as largeid6, x.v1, large.v2 FROM x JOIN large ON x.id3 = large.id3;

benchmarks/queries/h2o/window.sql

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
-- Basic Window
2+
SELECT
3+
id1,
4+
id2,
5+
id3,
6+
v2,
7+
sum(v2) OVER () AS window_basic
8+
FROM large;
9+
10+
-- Sorted Window
11+
SELECT
12+
id1,
13+
id2,
14+
id3,
15+
v2,
16+
first_value(v2) OVER (ORDER BY id3) AS first_order_by,
17+
row_number() OVER (ORDER BY id3) AS row_number_order_by
18+
FROM large;
19+
20+
-- PARTITION BY
21+
SELECT
22+
id1,
23+
id2,
24+
id3,
25+
v2,
26+
sum(v2) OVER (PARTITION BY id1) AS sum_by_id1,
27+
sum(v2) OVER (PARTITION BY id2) AS sum_by_id2,
28+
sum(v2) OVER (PARTITION BY id3) AS sum_by_id3
29+
FROM large;
30+
31+
-- PARTITION BY ORDER BY
32+
SELECT
33+
id1,
34+
id2,
35+
id3,
36+
v2,
37+
first_value(v2) OVER (PARTITION BY id2 ORDER BY id3) AS first_by_id2_ordered_by_id3
38+
FROM large;
39+
40+
-- Lead and Lag
41+
SELECT
42+
id1,
43+
id2,
44+
id3,
45+
v2,
46+
first_value(v2) OVER (ORDER BY id3 ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS my_lag,
47+
first_value(v2) OVER (ORDER BY id3 ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS my_lead
48+
FROM large;
49+
50+
-- Moving Averages
51+
SELECT
52+
id1,
53+
id2,
54+
id3,
55+
v2,
56+
avg(v2) OVER (ORDER BY id3 ROWS BETWEEN 100 PRECEDING AND CURRENT ROW) AS my_moving_average
57+
FROM large;
58+
59+
-- Rolling Sum
60+
SELECT
61+
id1,
62+
id2,
63+
id3,
64+
v2,
65+
sum(v2) OVER (ORDER BY id3 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS my_rolling_sum
66+
FROM large;
67+
68+
-- RANGE BETWEEN
69+
SELECT
70+
id1,
71+
id2,
72+
id3,
73+
v2,
74+
sum(v2) OVER (ORDER BY v2 RANGE BETWEEN 3 PRECEDING AND CURRENT ROW) AS my_range_between
75+
FROM large;
76+
77+
-- First PARTITION BY ROWS BETWEEN
78+
SELECT
79+
id1,
80+
id2,
81+
id3,
82+
v2,
83+
first_value(v2) OVER (PARTITION BY id2 ORDER BY id3 ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS my_lag_by_id2,
84+
first_value(v2) OVER (PARTITION BY id2 ORDER BY id3 ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS my_lead_by_id2
85+
FROM large;
86+
87+
-- Moving Averages PARTITION BY
88+
SELECT
89+
id1,
90+
id2,
91+
id3,
92+
v2,
93+
avg(v2) OVER (PARTITION BY id2 ORDER BY id3 ROWS BETWEEN 100 PRECEDING AND CURRENT ROW) AS my_moving_average_by_id2
94+
FROM large;
95+
96+
-- Rolling Sum PARTITION BY
97+
SELECT
98+
id1,
99+
id2,
100+
id3,
101+
v2,
102+
sum(v2) OVER (PARTITION BY id2 ORDER BY id3 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS my_rolling_sum_by_id2
103+
FROM large;
104+
105+
-- RANGE BETWEEN PARTITION BY
106+
SELECT
107+
id1,
108+
id2,
109+
id3,
110+
v2,
111+
sum(v2) OVER (PARTITION BY id2 ORDER BY v2 RANGE BETWEEN 3 PRECEDING AND CURRENT ROW) AS my_range_between_by_id2
112+
FROM large;

0 commit comments

Comments
 (0)