-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Fix Puffin suffix for DV files #11986
Conversation
Working on a good place to add tests for this...we generally don't test against the actual file paths in Iceberg since they're not necessary for correctness but in this case it naturally causes confusion for users so I think it's good to have some assertions on the actual file path. |
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java
Outdated
Show resolved
Hide resolved
c354dbe
to
21a6ebb
Compare
21a6ebb
to
e13f67d
Compare
@@ -723,7 +727,6 @@ public DeleteGranularity deleteGranularity() { | |||
} | |||
|
|||
public boolean useDVs() { | |||
TableOperations ops = ((HasTableOperations) table).operations(); | |||
return ops.current().formatVersion() >= 3; | |||
return !(table instanceof BaseMetadataTable) && TableUtil.formatVersion(table) >= 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this because it is possible to pass in metadata tables to SparkWriteConf and then previously the useDVs check would fail, for cases like RewritePositionDeletes which passes in the PositionMetadataTable. Metadata tables by definition are read only and I think it doesn't make sense for the writeConf.useDVs() API to return true.
RewritePositionDeletes also uses metadataTable.deleteFileFormat() for determining the file format to write with but I didn't want to change anything additional on the deleteFileFormat() path since that's used in quite a bit more places...
I also refactored to use the new util which also handles the case where the table is a SerializableTable
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java
Outdated
Show resolved
Hide resolved
e13f67d
to
b342c7d
Compare
public boolean useDVs() { | ||
TableOperations ops = ((HasTableOperations) table).operations(); | ||
return ops.current().formatVersion() >= 3; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up removing this since I realized it's not yet an officially released public API on SparkWriteConf. We can just update deleteFileFormat() to return PUFFIN and then use that in SparkPositionDeltaWrite, that simplifies the conf while solving the original issue of not surfacing the right format to the output file factory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, I had a similar concern here: https://github.com/apache/iceberg/pull/11588/files#r1916175700
@@ -214,6 +214,10 @@ private boolean fanoutWriterEnabled(boolean defaultValue) { | |||
} | |||
|
|||
public FileFormat deleteFileFormat() { | |||
if (!(table instanceof BaseMetadataTable) && TableUtil.formatVersion(table) >= 3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #11986 (comment) for why I changed this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may not need table instanceof BaseMetadataTable
once we add support for minor DV compaction. It looks good now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Late +1 from me too.
Fixes #11968
Currently, when writing DVs in Spark, the incorrect file suffix is being used. This is because the file format passed to the output file factory which produces the actual filename is not correct and uses the default conf delete file format (typically the same as the data file format).
Note, the actual Puffin DVs were always being written when the table format is V3, it just had the wrong suffix so users see "delete-foo.parquet" even though the file is really a Puffin file with DVs.