Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8385][VL]Support write compatible-hive bucket table for Spark3.4 and Spark3.5. #8386

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

yikf
Copy link
Contributor

@yikf yikf commented Jan 1, 2025

What changes were proposed in this pull request?

This PR amins to support write compatible-hive bucket table for Spark3.4 and Spark3.5. fix #8385

How was this patch tested?

unit tests.

@github-actions github-actions bot added CORE works for Gluten Core VELOX CLICKHOUSE DOCS labels Jan 1, 2025
Copy link

github-actions bot commented Jan 1, 2025

#8385

Copy link

github-actions bot commented Jan 1, 2025

Run Gluten Clickhouse CI on x86

@yikf
Copy link
Contributor Author

yikf commented Jan 1, 2025

@JkSelf @jackylee-ch @zhouyuan Could you please take a look if you find a time, thanks!

@JkSelf
Copy link
Contributor

JkSelf commented Jan 2, 2025

cc @ulysses-you

.getOrElse("__hive_compatible_bucketed_table_insertion__", "false")
.equals("true")
// Support hive compatible bucket and partition table.
if (bucketSpec.isEmpty || (isHiveCompatibleBucketTable && isPartitionedTable)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it requires partitioned table ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that velox currently only supports writing to bucketed and partitioned tables, see this code. I'm not sure of the reason behind it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yikf It seems velox already support bucket write with non-partitioned table facebookincubator/velox#9740. Can you help to verify? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf It seems to only support the CTAS situations.
image

I tested it with the latest code, it failed due to this branch.
image

I guess this PR won't reach this branch if the table doesn't exist.
image

return std::make_shared<connector::hive::LocationHandle>(
targetDirectory, writeDirectory.value_or(targetDirectory), tableType, targetFileName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only use hive style file name for bucketed writing, and for a non-bucketed table, keep previous behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion.

For non-bucket tables, we can maintain the previous approach.

For bucket tables, it seems that the bucketId can only be calculated during the write process, thereby determining the final file name, which is controlled by Velox.

@@ -28,6 +28,7 @@ changed `Unbounded` in `WindowFunction` into `Unbounded_Preceding` and `Unbounde
* Added `WriteRel` ([#3690](https://github.com/apache/incubator-gluten/pull/3690)).
* Added `TopNRel` ([#5409](https://github.com/apache/incubator-gluten/pull/5409)).
* Added `ref` field in window bound `Preceding` and `Following` ([#5626](https://github.com/apache/incubator-gluten/pull/5626)).
* Added `BucketSpec` field in `WriteRel`([](https://github.com/apache/incubator-gluten/pull/))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

miss pr number

Copy link

github-actions bot commented Jan 3, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Jan 3, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Jan 4, 2025

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VL] Support write compatible-hive bucket table for Spark3.4 and Spark3.5.
3 participants