-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SDP] [SPARK-52576] Drop/recreate on full refresh and MV update #51280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, can change if we add support for drop Column for HMS in the V2SessionCatalog
val dropTable = (isFullRefresh || !table.isStreamingTableOpt.get) && existingTableOpt.isDefined | ||
if (dropTable) { | ||
catalog.dropTable(identifier) | ||
// context.spark.sql(s"DROP TABLE ${table.identifier.quotedString}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove? Optionally add comment about why not truncate/alter?
val dropTable = (isFullRefresh || !table.isStreamingTableOpt.get) && existingTableOpt.isDefined | ||
if (dropTable) { | ||
catalog.dropTable(identifier) | ||
// context.spark.sql(s"DROP TABLE ${table.identifier.quotedString}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we remove this line?
@@ -446,8 +446,9 @@ class MaterializeTablesSuite extends BaseCoreExecutionTest { | |||
|
|||
val table2 = catalog.loadTable(identifier) | |||
assert( | |||
table2.columns() sameElements CatalogV2Util | |||
.structTypeToV2Columns(new StructType().add("y", IntegerType).add("x", BooleanType)) | |||
table2.columns().toSet == CatalogV2Util |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ordering of columns does not appear to be deterministic (at least across different catalog implementations). Is that unexpected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for a table, the column order matters. I think we should keep the test as it is and fix the issues we found.
What changes were proposed in this pull request?
Some pipeline runs result in wiping out and replacing all the data for a table:
In the current implementation, this "wipe out and replace" is implemented by:
The reason that we want originally wanted to truncate + alter instead of drop / recreate is that dropping has some undesirable effects. E.g. it interrupts readers of the table and wipes away things like ACLs.
However, we discovered that not all catalogs support dropping columns (e.g. Hive does not), and there’s no way to tell whether a catalog supports dropping columns or not. So this PR changes the implementation to drop/recreate the table instead of truncate/alter.
Why are the changes needed?
See section above.
Does this PR introduce any user-facing change?
Yes, see section above. No releases contained the old behavior.
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No