[GLUTEN-8385][VL]Support write compatible-hive bucket table for Spark3.4 and Spark3.5. #8386

yikf · 2025-01-01T05:32:58Z

What changes were proposed in this pull request?

This PR amins to support write compatible-hive bucket table for Spark3.4 and Spark3.5. fix #8385

How was this patch tested?

unit tests.

github-actions · 2025-01-01T05:33:16Z

#8385

github-actions · 2025-01-01T05:33:30Z

Run Gluten Clickhouse CI on x86

yikf · 2025-01-01T05:33:36Z

@JkSelf @jackylee-ch @zhouyuan Could you please take a look if you find a time, thanks!

JkSelf · 2025-01-02T01:12:59Z

cc @ulysses-you

ulysses-you · 2025-01-02T01:54:51Z

backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxBackend.scala

+        .getOrElse("__hive_compatible_bucketed_table_insertion__", "false")
+        .equals("true")
+      // Support hive compatible bucket and partition table.
+      if (bucketSpec.isEmpty || (isHiveCompatibleBucketTable && isPartitionedTable)) {


Why it requires partitioned table ?

It seems that velox currently only supports writing to bucketed and partitioned tables, see this code. I'm not sure of the reason behind it.

@yikf It seems velox already support bucket write with non-partitioned table facebookincubator/velox#9740. Can you help to verify? Thanks.

@JkSelf It seems to only support the CTAS situations.

I tested it with the latest code, it failed due to this branch.

I guess this PR won't reach this branch if the table doesn't exist.

ulysses-you · 2025-01-02T01:56:25Z

cpp/velox/substrait/SubstraitToVeloxPlan.cc

  return std::make_shared<connector::hive::LocationHandle>(
-      targetDirectory, writeDirectory.value_or(targetDirectory), tableType, targetFileName);


I think we should only use hive style file name for bucketed writing, and for a non-bucketed table, keep previous behavior.

Thank you for your suggestion.

For non-bucket tables, we can maintain the previous approach.

For bucket tables, it seems that the bucketId can only be calculated during the write process, thereby determining the final file name, which is controlled by Velox.

ulysses-you · 2025-01-02T02:01:30Z

docs/developers/SubstraitModifications.md

@@ -28,6 +28,7 @@ changed `Unbounded` in `WindowFunction` into `Unbounded_Preceding` and `Unbounde
 * Added `WriteRel` ([#3690](https://github.com/apache/incubator-gluten/pull/3690)).
 * Added `TopNRel` ([#5409](https://github.com/apache/incubator-gluten/pull/5409)).
 * Added `ref` field in window bound `Preceding` and `Following` ([#5626](https://github.com/apache/incubator-gluten/pull/5626)).
+* Added `BucketSpec` field in `WriteRel`([](https://github.com/apache/incubator-gluten/pull/))


miss pr number

…tion

github-actions · 2025-01-03T09:54:10Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-03T16:14:31Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-04T01:20:27Z

Run Gluten Clickhouse CI on x86

github-actions bot added CORE works for Gluten Core VELOX CLICKHOUSE DOCS labels Jan 1, 2025

ulysses-you reviewed Jan 2, 2025

View reviewed changes

Native writer support compatible hive bucket write with dynamic parti…

c44b367

…tion

yikf force-pushed the bucket-write branch from a57f370 to c44b367 Compare January 3, 2025 09:53

Merge branch 'main' into bucket-write

b10902a

Merge branch 'main' into bucket-write

d681786

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-8385][VL]Support write compatible-hive bucket table for Spark3.4 and Spark3.5. #8386

[GLUTEN-8385][VL]Support write compatible-hive bucket table for Spark3.4 and Spark3.5. #8386

yikf commented Jan 1, 2025

github-actions bot commented Jan 1, 2025

github-actions bot commented Jan 1, 2025

yikf commented Jan 1, 2025

JkSelf commented Jan 2, 2025

ulysses-you Jan 2, 2025

yikf Jan 2, 2025

JkSelf Jan 3, 2025

yikf Jan 3, 2025

ulysses-you Jan 2, 2025

yikf Jan 2, 2025

ulysses-you Jan 2, 2025

github-actions bot commented Jan 3, 2025

github-actions bot commented Jan 3, 2025

github-actions bot commented Jan 4, 2025

		return std::make_shared<connector::hive::LocationHandle>(
		targetDirectory, writeDirectory.value_or(targetDirectory), tableType, targetFileName);

[GLUTEN-8385][VL]Support write compatible-hive bucket table for Spark3.4 and Spark3.5. #8386

Are you sure you want to change the base?

[GLUTEN-8385][VL]Support write compatible-hive bucket table for Spark3.4 and Spark3.5. #8386

Conversation

yikf commented Jan 1, 2025

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Jan 1, 2025

github-actions bot commented Jan 1, 2025

yikf commented Jan 1, 2025

JkSelf commented Jan 2, 2025

ulysses-you Jan 2, 2025

Choose a reason for hiding this comment

yikf Jan 2, 2025

Choose a reason for hiding this comment

JkSelf Jan 3, 2025

Choose a reason for hiding this comment

yikf Jan 3, 2025

Choose a reason for hiding this comment

ulysses-you Jan 2, 2025

Choose a reason for hiding this comment

yikf Jan 2, 2025

Choose a reason for hiding this comment

ulysses-you Jan 2, 2025

Choose a reason for hiding this comment

github-actions bot commented Jan 3, 2025

github-actions bot commented Jan 3, 2025

github-actions bot commented Jan 4, 2025