You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The h2o.ai benchmarks are a set of performance tests for groupby and join operations. Beyond the standard h2o benchmark, there is also an extended benchmark for window functions. These benchmarks use synthetic data with configurable sizes (small: 1e7 rows, medium: 1e8 rows, big: 1e9 rows) to evaluate DataFusion's performance across different data scales.
549
550
550
-
### Generate data for h2o benchmarks
551
+
Reference:
552
+
-[H2O AI Benchmark](https://duckdb.org/2023/04/14/h2oai.html)
3. Generate large data (4 table files, the largest is 1e9 rows)
611
-
```bash
612
-
./bench.sh data h2o_big_join
613
-
```
623
+
### Extended h2o benchmarks for window
614
624
615
-
### Run h2o benchmarks
616
-
There are three options for running h2o benchmarks: `small`, `medium`, and `big`.
617
-
1. Run small data benchmark
618
-
```bash
619
-
./bench.sh run h2o_small_join
620
-
```
625
+
This benchmark extends the h2o benchmark suite to evaluate window function performance. H2o window benchmark uses the same dataset as the h2o join benchmark. There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`.
621
626
622
-
2. Run medium data benchmark
623
-
```bash
624
-
./bench.sh run h2o_medium_join
625
-
```
627
+
Here is a example to generate `small` dataset and run the benchmark. To run other
628
+
dataset size configuration, change the command similar to the previous example.
626
629
627
-
3. Run large data benchmark
628
630
```bash
629
-
./bench.sh run h2o_big_join
631
+
# Generate small data
632
+
./bench.sh data h2o_small_window
633
+
634
+
# Run the benchmark
635
+
./bench.sh run h2o_small_window
630
636
```
631
637
632
-
4. Run a specific query with a specific join data paths, the data paths are including 4 table files.
638
+
To run a specific query with a specific window data paths, the data paths are including 4 table files (the same as h2o-join dataset)
633
639
634
640
For example, to run query 1 with the small data generated above:
SELECT id1, id2, SUM(v1) AS v1 FROM x GROUP BY id1, id2;
4
+
3
5
SELECT id3, SUM(v1) AS v1, AVG(v3) AS v3 FROM x GROUP BY id3;
6
+
4
7
SELECT id4, AVG(v1) AS v1, AVG(v2) AS v2, AVG(v3) AS v3 FROM x GROUP BY id4;
8
+
5
9
SELECT id6, SUM(v1) AS v1, SUM(v2) AS v2, SUM(v3) AS v3 FROM x GROUP BY id6;
10
+
6
11
SELECT id4, id5, MEDIAN(v3) AS median_v3, STDDEV(v3) AS sd_v3 FROM x GROUP BY id4, id5;
12
+
7
13
SELECT id3, MAX(v1) -MIN(v2) AS range_v1_v2 FROM x GROUP BY id3;
14
+
8
15
SELECT id6, largest2_v3 FROM (SELECT id6, v3 AS largest2_v3, ROW_NUMBER() OVER (PARTITION BY id6 ORDER BY v3 DESC) AS order_v3 FROM x WHERE v3 IS NOT NULL) sub_query WHERE order_v3 <=2;
16
+
9
17
SELECT id2, id4, POWER(CORR(v1, v2), 2) AS r2 FROM x GROUP BY id2, id4;
10
-
SELECT id1, id2, id3, id4, id5, id6, SUM(v3) AS v3, COUNT(*) AS count FROM x GROUP BY id1, id2, id3, id4, id5, id6;
18
+
19
+
SELECT id1, id2, id3, id4, id5, id6, SUM(v3) AS v3, COUNT(*) AS count FROM x GROUP BY id1, id2, id3, id4, id5, id6;
0 commit comments