-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-8385][VL]Support write compatible-hive bucket table for Spark3.4 and Spark3.5. #8386
base: main
Are you sure you want to change the base?
Conversation
Run Gluten Clickhouse CI on x86 |
@JkSelf @jackylee-ch @zhouyuan Could you please take a look if you find a time, thanks! |
cc @ulysses-you |
.getOrElse("__hive_compatible_bucketed_table_insertion__", "false") | ||
.equals("true") | ||
// Support hive compatible bucket and partition table. | ||
if (bucketSpec.isEmpty || (isHiveCompatibleBucketTable && isPartitionedTable)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it requires partitioned table ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that velox currently only supports writing to bucketed and partitioned tables, see this code. I'm not sure of the reason behind it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yikf It seems velox already support bucket write with non-partitioned table facebookincubator/velox#9740. Can you help to verify? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf It seems to only support the CTAS situations.
I tested it with the latest code, it failed due to this branch.
I guess this PR won't reach this branch if the table doesn't exist.
return std::make_shared<connector::hive::LocationHandle>( | ||
targetDirectory, writeDirectory.value_or(targetDirectory), tableType, targetFileName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only use hive style file name for bucketed writing, and for a non-bucketed table, keep previous behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion.
For non-bucket tables, we can maintain the previous approach.
For bucket tables, it seems that the bucketId can only be calculated during the write process, thereby determining the final file name, which is controlled by Velox.
@@ -28,6 +28,7 @@ changed `Unbounded` in `WindowFunction` into `Unbounded_Preceding` and `Unbounde | |||
* Added `WriteRel` ([#3690](https://github.com/apache/incubator-gluten/pull/3690)). | |||
* Added `TopNRel` ([#5409](https://github.com/apache/incubator-gluten/pull/5409)). | |||
* Added `ref` field in window bound `Preceding` and `Following` ([#5626](https://github.com/apache/incubator-gluten/pull/5626)). | |||
* Added `BucketSpec` field in `WriteRel`([](https://github.com/apache/incubator-gluten/pull/)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
miss pr number
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
What changes were proposed in this pull request?
This PR amins to support write compatible-hive bucket table for Spark3.4 and Spark3.5. fix #8385
How was this patch tested?
unit tests.