Skip to content

Commit 8e4b4cd

Browse files
author
zhangli20
committed
release version 3.0.0
refactor join implementations to support existence joins and BHJ building hash map on driver side. supports spark333 batch shuffle reading. update rust-toolchain to latest nightly version. other minor improvements. update docs.
1 parent 173607e commit 8e4b4cd

File tree

73 files changed

+4750
-3592
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

73 files changed

+4750
-3592
lines changed

.github/workflows/build-ce7-releases.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ jobs:
1212
strategy:
1313
matrix:
1414
sparkver: [spark303, spark333]
15-
blazever: [2.0.9.1]
15+
blazever: [3.0.0]
1616

1717
steps:
1818
- uses: actions/checkout@v4

.github/workflows/tpcds.yml

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -34,19 +34,18 @@ jobs:
3434
with: {version: "21.7"}
3535

3636
- uses: actions-rust-lang/setup-rust-toolchain@v1
37-
with: {rustflags: --allow warnings -C target-cpu=native}
37+
with:
38+
toolchain: nightly
39+
rustflags: --allow warnings -C target-feature=+aes
40+
components:
41+
cargo
42+
rustfmt
3843

3944
- name: Rustfmt Check
4045
uses: actions-rust-lang/rustfmt@v1
4146

42-
## - name: Rust Clippy Check
43-
## uses: actions-rs/clippy-check@v1
44-
## with:
45-
## token: ${{ secrets.GITHUB_TOKEN }}
46-
## args: --all-features
47-
4847
- name: Cargo test
49-
run: cargo test --workspace --all-features
48+
run: cargo +nightly test --workspace --all-features
5049

5150
- name: Build Spark303
5251
run: mvn package -Ppre -Pspark303

Cargo.lock

Lines changed: 26 additions & 25 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -64,26 +64,26 @@ serde_json = { version = "1.0.96" }
6464

6565
[patch.crates-io]
6666
# datafusion: branch=v36-blaze
67-
datafusion = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
68-
datafusion-common = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
69-
datafusion-expr = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
70-
datafusion-execution = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
71-
datafusion-optimizer = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
72-
datafusion-physical-expr = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
67+
datafusion = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
68+
datafusion-common = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
69+
datafusion-expr = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
70+
datafusion-execution = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
71+
datafusion-optimizer = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
72+
datafusion-physical-expr = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
7373

7474
# arrow: branch=v50-blaze
75-
arrow = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
76-
arrow-arith = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
77-
arrow-array = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
78-
arrow-buffer = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
79-
arrow-cast = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
80-
arrow-data = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
81-
arrow-ord = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
82-
arrow-row = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
83-
arrow-schema = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
84-
arrow-select = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
85-
arrow-string = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
86-
parquet = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
75+
arrow = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
76+
arrow-arith = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
77+
arrow-array = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
78+
arrow-buffer = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
79+
arrow-cast = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
80+
arrow-data = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
81+
arrow-ord = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
82+
arrow-row = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
83+
arrow-schema = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
84+
arrow-select = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
85+
arrow-string = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
86+
parquet = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
8787

8888
# serde_json: branch=v1.0.96-blaze
8989
serde_json = { git = "https://github.com/blaze-init/json", branch = "v1.0.96-blaze" }

README.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ Blaze._
7373

7474
```shell
7575
SHIM=spark333 # or spark303
76-
MODE=release # or dev
76+
MODE=release # or pre
7777
mvn package -P"${SHIM}" -P"${MODE}"
7878
```
7979

@@ -94,11 +94,16 @@ This section describes how to submit and configure a Spark Job with Blaze suppor
9494
1. move blaze jar package to spark client classpath (normally `spark-xx.xx.xx/jars/`).
9595

9696
2. add the follow confs to spark configuration in `spark-xx.xx.xx/conf/spark-default.conf`:
97+
9798
```properties
99+
spark.blaze.enable true
98100
spark.sql.extensions org.apache.spark.sql.blaze.BlazeSparkSessionExtension
99101
spark.shuffle.manager org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager
102+
spark.memory.offHeap.enabled false
100103

101-
# other blaze confs defined in spark-extension/src/main/java/org/apache/spark/sql/blaze/BlazeConf.java
104+
# suggested executor memory configuration
105+
spark.executor.memory 4g
106+
spark.executor.memoryOverhead 4096
102107
```
103108

104109
3. submit a query with spark-sql, or other tools like spark-thriftserver:
@@ -108,16 +113,15 @@ spark-sql -f tpcds/q01.sql
108113

109114
## Performance
110115

111-
Check [Benchmark Results](./benchmark-results/20240202.md) with the latest date for the performance
112-
comparison with vanilla Spark on TPC-DS 1TB dataset. The benchmark result shows that Blaze saved
113-
~55% query time and ~60% cluster resources in average. ~6x performance achieved for the best case (q06).
116+
Check [Benchmark Results](./benchmark-results/20240701-blaze300.md) with the latest date for the performance
117+
comparison with vanilla Spark 3.3.3. The benchmark result shows that Blaze save about 50% time on TPC-DS/TPC-H 1TB datasets.
114118
Stay tuned and join us for more upcoming thrilling numbers.
115119

116-
Query time:
117-
![20240202-query-time](./benchmark-results/blaze-query-time-comparison-20240202.png)
120+
TPC-DS Query time:
121+
![20240701-query-time-tpcds](./benchmark-results/spark333-vs-blaze300-query-time-20240701.png)
118122

119-
Cluster resources:
120-
![20240202-resources](./benchmark-results/blaze-cluster-resources-cost-comparison-20240202.png)
123+
TPC-H Query time:
124+
![20240701-query-time-tpch](./benchmark-results/spark333-vs-blaze300-query-time-20240701-tpch.png)
121125

122126
We also encourage you to benchmark Blaze and share the results with us. 🤗
123127

RELEASES.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,15 @@
1-
# blaze-v2.0.9.1
1+
# blaze-v3.0.0
22

33
## Features
4-
* Supports failing-back nondeterministic expressions.
5-
* Supports "$[].xxx" jsonpath syntax in get_json_object().
4+
* Supports using spark.io.compression.codec for shuffle/broadcast compression
5+
* Supports date type casting
6+
* Refactor join implementations to support existence joins and BHJ building hash map on driver side
67

78
## Performance
8-
* Supports adaptive batch size in ParquetScan, improving vectorized reading performance.
9-
* Supports directly spill to disk file when on-heap memory is full.
9+
* Fixed performance issues when running on spark3 with default configurations
10+
* Use cached parquet metadata
11+
* Refactor native broadcast to avoid duplicated broadcast jobs
12+
* Supports spark333 batch shuffle reading
1013

1114
## Bugfix
12-
* Fix incorrect parquet rowgroup pruning with files containing deprecated min/max values.
15+
* Fix in_list conversion in from_proto.rs

0 commit comments

Comments
 (0)