Add h2o window benchmark (#16003)

2010YOUY01 · web-flow · commit 8f898a7856f1 · 2025-05-14T16:54:29.000-04:00
* h2o-window benchmark

* Review: clarify h2o-window is an extended benchmark
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -545,9 +545,16 @@ cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p '...
 ```
 
 
-## h2o benchmarks for groupby
+## h2o.ai benchmarks
+The h2o.ai benchmarks are a set of performance tests for groupby and join operations. Beyond the standard h2o benchmark, there is also an extended benchmark for window functions. These benchmarks use synthetic data with configurable sizes (small: 1e7 rows, medium: 1e8 rows, big: 1e9 rows) to evaluate DataFusion's performance across different data scales.
 
-### Generate data for h2o benchmarks
+Reference:
+- [H2O AI Benchmark](https://duckdb.org/2023/04/14/h2oai.html)
+- [Extended window benchmark](https://duckdb.org/2024/06/26/benchmarks-over-time.html#window-functions-benchmark)
+
+### h2o benchmarks for groupby
+
+#### Generate data for h2o benchmarks
 There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`. The data is generated in the `data` directory.
 
 1. Generate small data (1e7 rows)
@@ -567,7 +574,7 @@ There are three options for generating data for h2o benchmarks: `small`, `medium
 ./bench.sh data h2o_big
 ```
 
-### Run h2o benchmarks
+#### Run h2o benchmarks
 There are three options for running h2o benchmarks: `small`, `medium`, and `big`.
 1. Run small data benchmark
 ```bash
@@ -591,49 +598,46 @@ For example, to run query 1 with the small data generated above:
 cargo run --release --bin dfbench -- h2o --path ./benchmarks/data/h2o/G1_1e7_1e7_100_0.csv  --query 1
 ```
 
-## h2o benchmarks for join
+### h2o benchmarks for join
 
-### Generate data for h2o benchmarks
 There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`. The data is generated in the `data` directory.
 
-1. Generate small data (4 table files, the largest is 1e7 rows)
+Here is a example to generate `small` dataset and run the benchmark. To run other 
+dataset size configuration, change the command similar to the previous example.
+
 ```bash
+# Generate small data (4 table files, the largest is 1e7 rows)
 ./bench.sh data h2o_small_join
+
+# Run the benchmark
+./bench.sh run h2o_small_join
 ```
 
+To run a specific query with a specific join data paths, the data paths are including 4 table files.
 
-2. Generate medium data (4 table files, the largest is 1e8 rows)
+For example, to run query 1 with the small data generated above:
 ```bash
-./bench.sh data h2o_medium_join
+cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/join.sql --query 1
 ```
 
-3. Generate large data (4 table files, the largest is 1e9 rows)
-```bash
-./bench.sh data h2o_big_join
-```
+### Extended h2o benchmarks for window
 
-### Run h2o benchmarks
-There are three options for running h2o benchmarks: `small`, `medium`, and `big`.
-1. Run small data benchmark
-```bash
-./bench.sh run h2o_small_join
-```
+This benchmark extends the h2o benchmark suite to evaluate window function performance. H2o window benchmark uses the same dataset as the h2o join benchmark. There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`.
 
-2. Run medium data benchmark
-```bash
-./bench.sh run h2o_medium_join
-```
+Here is a example to generate `small` dataset and run the benchmark. To run other 
+dataset size configuration, change the command similar to the previous example.
 
-3. Run large data benchmark
 ```bash
-./bench.sh run h2o_big_join
+# Generate small data
+./bench.sh data h2o_small_window
+
+# Run the benchmark
+./bench.sh run h2o_small_window
 ```
 
-4. Run a specific query with a specific join data paths, the data paths are including 4 table files.
+To run a specific query with a specific window data paths, the data paths are including 4 table files (the same as h2o-join dataset)
 
 For example, to run query 1 with the small data generated above:
 ```bash
-cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/join.sql --query 1
+cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/window.sql --query 1
 ```
-[1]: http://www.tpc.org/tpch/
-[2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
diff --git a/benchmarks/bench.sh b/benchmarks/bench.sh
@@ -87,6 +87,9 @@ h2o_big:                h2oai benchmark with large dataset (1e9 rows) for groupb
 h2o_small_join:         h2oai benchmark with small dataset (1e7 rows) for join,  default file format is csv
 h2o_medium_join:        h2oai benchmark with medium dataset (1e8 rows) for join, default file format is csv
 h2o_big_join:           h2oai benchmark with large dataset (1e9 rows) for join,  default file format is csv
+h2o_small_window:       Extended h2oai benchmark with small dataset (1e7 rows) for window,  default file format is csv
+h2o_medium_window:      Extended h2oai benchmark with medium dataset (1e8 rows) for window, default file format is csv
+h2o_big_window:         Extended h2oai benchmark with large dataset (1e9 rows) for window,  default file format is csv
 imdb:                   Join Order Benchmark (JOB) using the IMDB dataset converted to parquet
 
 **********
@@ -205,6 +208,16 @@ main() {
                 h2o_big_join)
                     data_h2o_join "BIG" "CSV"
                     ;;
+                # h2o window benchmark uses the same data as the h2o join
+                h2o_small_window)
+                    data_h2o_join "SMALL" "CSV"
+                    ;;
+                h2o_medium_window)
+                    data_h2o_join "MEDIUM" "CSV"
+                    ;;
+                h2o_big_window)
+                    data_h2o_join "BIG" "CSV"
+                    ;;
                 external_aggr)
                     # same data as for tpch
                     data_tpch "1"
@@ -315,6 +328,15 @@ main() {
                 h2o_big_join)
                     run_h2o_join "BIG" "CSV" "join"
                     ;;
+                h2o_small_window)
+                    run_h2o_window "SMALL" "CSV" "window"
+                    ;;
+                h2o_medium_window)
+                    run_h2o_window "MEDIUM" "CSV" "window"
+                    ;;
+                h2o_big_window) 
+                    run_h2o_window "BIG" "CSV" "window"
+                    ;;
                 external_aggr)
                     run_external_aggr
                     ;;
@@ -801,6 +823,7 @@ data_h2o_join() {
     deactivate
 }
 
+# Runner for h2o groupby benchmark
 run_h2o() {
     # Default values for size and data format
     SIZE=${1:-"SMALL"}
@@ -843,7 +866,8 @@ run_h2o() {
         -o "${RESULTS_FILE}"
 }
 
-run_h2o_join() {
+# Utility function to run h2o join/window benchmark
+h2o_runner() {
     # Default values for size and data format
     SIZE=${1:-"SMALL"}
     DATA_FORMAT=${2:-"CSV"}
@@ -852,10 +876,10 @@ run_h2o_join() {
 
     # Data directory and results file path
     H2O_DIR="${DATA_DIR}/h2o"
-    RESULTS_FILE="${RESULTS_DIR}/h2o_join.json"
+    RESULTS_FILE="${RESULTS_DIR}/h2o_${RUN_Type}.json"
 
     echo "RESULTS_FILE: ${RESULTS_FILE}"
-    echo "Running h2o join benchmark..."
+    echo "Running h2o ${RUN_Type} benchmark..."
 
     # Set the file name based on the size
     case "$SIZE" in
@@ -883,7 +907,7 @@ run_h2o_join() {
             ;;
     esac
 
-     # Set the query file name based on the RUN_Type
+    # Set the query file name based on the RUN_Type
     QUERY_FILE="${SCRIPT_DIR}/queries/h2o/${RUN_Type}.sql"
 
     $CARGO_COMMAND --bin dfbench -- h2o \
@@ -893,6 +917,16 @@ run_h2o_join() {
         -o "${RESULTS_FILE}"
 }
 
+# Runners for h2o join benchmark
+run_h2o_join() {
+    h2o_runner "$1" "$2" "join"
+}
+
+# Runners for h2o join benchmark
+run_h2o_window() {
+    h2o_runner "$1" "$2" "window"
+}
+
 # Runs the external aggregation benchmark
 run_external_aggr() {
     # Use TPC-H SF1 dataset
diff --git a/benchmarks/queries/h2o/groupby.sql b/benchmarks/queries/h2o/groupby.sql
@@ -1,10 +1,19 @@
 SELECT id1, SUM(v1) AS v1 FROM x GROUP BY id1;
+
 SELECT id1, id2, SUM(v1) AS v1 FROM x GROUP BY id1, id2;
+
 SELECT id3, SUM(v1) AS v1, AVG(v3) AS v3 FROM x GROUP BY id3;
+
 SELECT id4, AVG(v1) AS v1, AVG(v2) AS v2, AVG(v3) AS v3 FROM x GROUP BY id4;
+
 SELECT id6, SUM(v1) AS v1, SUM(v2) AS v2, SUM(v3) AS v3 FROM x GROUP BY id6;
+
 SELECT id4, id5, MEDIAN(v3) AS median_v3, STDDEV(v3) AS sd_v3 FROM x GROUP BY id4, id5;
+
 SELECT id3, MAX(v1) - MIN(v2) AS range_v1_v2 FROM x GROUP BY id3;
+
 SELECT id6, largest2_v3 FROM (SELECT id6, v3 AS largest2_v3, ROW_NUMBER() OVER (PARTITION BY id6 ORDER BY v3 DESC) AS order_v3 FROM x WHERE v3 IS NOT NULL) sub_query WHERE order_v3 <= 2;
+
 SELECT id2, id4, POWER(CORR(v1, v2), 2) AS r2 FROM x GROUP BY id2, id4;
-SELECT id1, id2, id3, id4, id5, id6, SUM(v3) AS v3, COUNT(*) AS count FROM x GROUP BY id1, id2, id3, id4, id5, id6;
+
+SELECT id1, id2, id3, id4, id5, id6, SUM(v3) AS v3, COUNT(*) AS count FROM x GROUP BY id1, id2, id3, id4, id5, id6;
diff --git a/benchmarks/queries/h2o/join.sql b/benchmarks/queries/h2o/join.sql
@@ -1,5 +1,9 @@
 SELECT x.id1, x.id2, x.id3, x.id4 as xid4, small.id4 as smallid4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1;
+
 SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x INNER JOIN medium ON x.id2 = medium.id2;
+
 SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x LEFT JOIN medium ON x.id2 = medium.id2;
+
 SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x JOIN medium ON x.id5 = medium.id5;
-SELECT x.id1 as xid1, large.id1 as largeid1, x.id2 as xid2, large.id2 as largeid2, x.id3, x.id4 as xid4, large.id4 as largeid4, x.id5 as xid5, large.id5 as largeid5, x.id6 as xid6, large.id6 as largeid6, x.v1, large.v2 FROM x JOIN large ON x.id3 = large.id3;
+
+SELECT x.id1 as xid1, large.id1 as largeid1, x.id2 as xid2, large.id2 as largeid2, x.id3, x.id4 as xid4, large.id4 as largeid4, x.id5 as xid5, large.id5 as largeid5, x.id6 as xid6, large.id6 as largeid6, x.v1, large.v2 FROM x JOIN large ON x.id3 = large.id3;
diff --git a/benchmarks/queries/h2o/window.sql b/benchmarks/queries/h2o/window.sql
@@ -0,0 +1,112 @@
+-- Basic Window
+SELECT 
+    id1,
+    id2,
+    id3,
+    v2,
+    sum(v2) OVER () AS window_basic
+FROM large;
+
+-- Sorted Window
+SELECT 
+    id1,
+    id2,
+    id3,
+    v2,
+    first_value(v2) OVER (ORDER BY id3) AS first_order_by,
+    row_number() OVER (ORDER BY id3) AS row_number_order_by
+FROM large;
+
+-- PARTITION BY
+SELECT 
+    id1,
+    id2,
+    id3,
+    v2,
+    sum(v2) OVER (PARTITION BY id1) AS sum_by_id1,
+    sum(v2) OVER (PARTITION BY id2) AS sum_by_id2,
+    sum(v2) OVER (PARTITION BY id3) AS sum_by_id3
+FROM large;
+
+-- PARTITION BY ORDER BY
+SELECT 
+    id1,
+    id2,
+    id3,
+    v2,
+    first_value(v2) OVER (PARTITION BY id2 ORDER BY id3) AS first_by_id2_ordered_by_id3
+FROM large;
+
+-- Lead and Lag
+SELECT 
+    id1,
+    id2,
+    id3,
+    v2,
+    first_value(v2) OVER (ORDER BY id3 ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS my_lag,
+    first_value(v2) OVER (ORDER BY id3 ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS my_lead
+FROM large;
+
+-- Moving Averages
+SELECT 
+    id1,
+    id2,
+    id3,
+    v2,
+    avg(v2) OVER (ORDER BY id3 ROWS BETWEEN 100 PRECEDING AND CURRENT ROW) AS my_moving_average
+FROM large;
+
+-- Rolling Sum
+SELECT 
+    id1,
+    id2,
+    id3,
+    v2,
+    sum(v2) OVER (ORDER BY id3 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS my_rolling_sum
+FROM large;
+
+-- RANGE BETWEEN
+SELECT 
+    id1,
+    id2,
+    id3,
+    v2,
+    sum(v2) OVER (ORDER BY v2 RANGE BETWEEN 3 PRECEDING AND CURRENT ROW) AS my_range_between
+FROM large;
+
+-- First PARTITION BY ROWS BETWEEN
+SELECT 
+    id1,
+    id2,
+    id3,
+    v2,
+    first_value(v2) OVER (PARTITION BY id2 ORDER BY id3 ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS my_lag_by_id2,
+    first_value(v2) OVER (PARTITION BY id2 ORDER BY id3 ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS my_lead_by_id2
+FROM large;
+
+-- Moving Averages PARTITION BY
+SELECT 
+    id1,
+    id2,
+    id3,
+    v2,
+    avg(v2) OVER (PARTITION BY id2 ORDER BY id3 ROWS BETWEEN 100 PRECEDING AND CURRENT ROW) AS my_moving_average_by_id2
+FROM large;
+
+-- Rolling Sum PARTITION BY
+SELECT 
+    id1,
+    id2,
+    id3,
+    v2,
+    sum(v2) OVER (PARTITION BY id2 ORDER BY id3 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS my_rolling_sum_by_id2
+FROM large;
+
+-- RANGE BETWEEN PARTITION BY
+SELECT 
+    id1,
+    id2,
+    id3,
+    v2,
+    sum(v2) OVER (PARTITION BY id2 ORDER BY v2 RANGE BETWEEN 3 PRECEDING AND CURRENT ROW) AS my_range_between_by_id2
+FROM large;
diff --git a/benchmarks/src/h2o.rs b/benchmarks/src/h2o.rs