[SDP] [SPARK-52576] Drop/recreate on full refresh and MV update #51280

sryza · 2025-06-25T17:51:21Z

What changes were proposed in this pull request?

Some pipeline runs result in wiping out and replacing all the data for a table:

Every run of a materialized view
Runs of streaming tables that have the "full refresh" flag

In the current implementation, this "wipe out and replace" is implemented by:

Truncating the table
Altering the table to drop/update/add columns that don't match the columns in the DataFrame for the current run

The reason that we want originally wanted to truncate + alter instead of drop / recreate is that dropping has some undesirable effects. E.g. it interrupts readers of the table and wipes away things like ACLs.

However, we discovered that not all catalogs support dropping columns (e.g. Hive does not), and there’s no way to tell whether a catalog supports dropping columns or not. So this PR changes the implementation to drop/recreate the table instead of truncate/alter.

Why are the changes needed?

See section above.

Does this PR introduce any user-facing change?

Yes, see section above. No releases contained the old behavior.

How was this patch tested?

Tests in MaterializeTablesSuite
Ran the tests in MaterializeTablesSuite with Hive instead of the default catalog

Was this patch authored or co-authored using generative AI tooling?

No

szehon-ho

Makes sense, can change if we add support for drop Column for HMS in the V2SessionCatalog

szehon-ho · 2025-06-25T21:04:12Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DatasetManager.scala

+    val dropTable = (isFullRefresh || !table.isStreamingTableOpt.get) && existingTableOpt.isDefined
+    if (dropTable) {
+      catalog.dropTable(identifier)
+//      context.spark.sql(s"DROP TABLE ${table.identifier.quotedString}")


nit: remove? Optionally add comment about why not truncate/alter?

gengliangwang · 2025-06-25T22:03:12Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DatasetManager.scala

+    val dropTable = (isFullRefresh || !table.isStreamingTableOpt.get) && existingTableOpt.isDefined
+    if (dropTable) {
+      catalog.dropTable(identifier)
+//      context.spark.sql(s"DROP TABLE ${table.identifier.quotedString}")


shall we remove this line?

gengliangwang · 2025-06-25T22:31:54Z

sql/pipelines/src/test/scala/org/apache/spark/sql/pipelines/graph/MaterializeTablesSuite.scala

@@ -446,8 +446,9 @@ class MaterializeTablesSuite extends BaseCoreExecutionTest {

    val table2 = catalog.loadTable(identifier)
    assert(
-      table2.columns() sameElements CatalogV2Util
-        .structTypeToV2Columns(new StructType().add("y", IntegerType).add("x", BooleanType))
+      table2.columns().toSet == CatalogV2Util


why do we need this change?

The ordering of columns does not appear to be deterministic (at least across different catalog implementations). Is that unexpected?

for a table, the column order matters. I think we should keep the test as it is and fix the issues we found.

drop on full refresh

1a00003

sryza requested review from gengliangwang and cloud-fan June 25, 2025 17:51

github-actions bot added the SQL label Jun 25, 2025

sryza changed the title ~~[SDP] Drop/recreate on full refresh and MV update~~ [SDP] [SPARK-52576] Drop/recreate on full refresh and MV update Jun 25, 2025

szehon-ho approved these changes Jun 25, 2025

View reviewed changes

gengliangwang reviewed Jun 25, 2025

View reviewed changes

remove commented-out code

60f7918

sryza requested review from gengliangwang and szehon-ho June 26, 2025 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SDP] [SPARK-52576] Drop/recreate on full refresh and MV update #51280

[SDP] [SPARK-52576] Drop/recreate on full refresh and MV update #51280

sryza commented Jun 25, 2025

Uh oh!

szehon-ho left a comment

Uh oh!

szehon-ho Jun 25, 2025 •

edited

Loading

Uh oh!

gengliangwang Jun 25, 2025

Uh oh!

gengliangwang Jun 25, 2025

Uh oh!

sryza Jun 26, 2025

Uh oh!

cloud-fan Jun 26, 2025

Uh oh!

Uh oh!

[SDP] [SPARK-52576] Drop/recreate on full refresh and MV update #51280

Are you sure you want to change the base?

[SDP] [SPARK-52576] Drop/recreate on full refresh and MV update #51280

Conversation

sryza commented Jun 25, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

sryza Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

szehon-ho Jun 25, 2025 •

edited

Loading