Spark: Fix Puffin suffix for DV files #11986

amogh-jahagirdar · 2025-01-17T02:55:01Z

Fixes #11968

Currently, when writing DVs in Spark, the incorrect file suffix is being used. This is because the file format passed to the output file factory which produces the actual filename is not correct and uses the default conf delete file format (typically the same as the data file format).

Note, the actual Puffin DVs were always being written when the table format is V3, it just had the wrong suffix so users see "delete-foo.parquet" even though the file is really a Puffin file with DVs.

amogh-jahagirdar · 2025-01-17T02:56:57Z

Working on a good place to add tests for this...we generally don't test against the actual file paths in Iceberg since they're not necessary for correctness but in this case it naturally causes confusion for users so I think it's good to have some assertions on the actual file path.

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

amogh-jahagirdar · 2025-01-17T04:58:39Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

@@ -723,7 +727,6 @@ public DeleteGranularity deleteGranularity() {
  }

  public boolean useDVs() {
-    TableOperations ops = ((HasTableOperations) table).operations();
-    return ops.current().formatVersion() >= 3;
+    return !(table instanceof BaseMetadataTable) && TableUtil.formatVersion(table) >= 3;


I changed this because it is possible to pass in metadata tables to SparkWriteConf and then previously the useDVs check would fail, for cases like RewritePositionDeletes which passes in the PositionMetadataTable. Metadata tables by definition are read only and I think it doesn't make sense for the writeConf.useDVs() API to return true.

RewritePositionDeletes also uses metadataTable.deleteFileFormat() for determining the file format to write with but I didn't want to change anything additional on the deleteFileFormat() path since that's used in quite a bit more places...

I also refactored to use the new util which also handles the case where the table is a SerializableTable

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

amogh-jahagirdar · 2025-01-17T05:20:56Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

-  public boolean useDVs() {
-    TableOperations ops = ((HasTableOperations) table).operations();
-    return ops.current().formatVersion() >= 3;
-  }


I ended up removing this since I realized it's not yet an officially released public API on SparkWriteConf. We can just update deleteFileFormat() to return PUFFIN and then use that in SparkPositionDeltaWrite, that simplifies the conf while solving the original issue of not surfacing the right format to the output file factory

SGTM, I had a similar concern here: https://github.com/apache/iceberg/pull/11588/files#r1916175700

amogh-jahagirdar · 2025-01-17T05:23:26Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

@@ -214,6 +214,10 @@ private boolean fanoutWriterEnabled(boolean defaultValue) {
  }

  public FileFormat deleteFileFormat() {
+    if (!(table instanceof BaseMetadataTable) && TableUtil.formatVersion(table) >= 3) {


See #11986 (comment) for why I changed this

We may not need table instanceof BaseMetadataTable once we add support for minor DV compaction. It looks good now.

Fokko

LGTM, thanks for fixing this.

cc @nastra as it has an overlap with #11588

amogh-jahagirdar · 2025-01-17T13:59:13Z

Thanks for reviewing @nastra @Fokko!

aokolnychyi

Late +1 from me too.

github-actions bot added the spark label Jan 17, 2025

amogh-jahagirdar commented Jan 17, 2025

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java Outdated Show resolved Hide resolved

amogh-jahagirdar force-pushed the fix-dv-suffix branch 2 times, most recently from c354dbe to 21a6ebb Compare January 17, 2025 03:36

amogh-jahagirdar marked this pull request as ready for review January 17, 2025 03:37

amogh-jahagirdar requested review from nastra, aokolnychyi and Fokko January 17, 2025 03:37

amogh-jahagirdar added this to the Iceberg 1.8.0 milestone Jan 17, 2025

amogh-jahagirdar force-pushed the fix-dv-suffix branch from 21a6ebb to e13f67d Compare January 17, 2025 04:54

amogh-jahagirdar commented Jan 17, 2025

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java Outdated Show resolved Hide resolved

Spark: Fix Puffin suffix for DVs

b342c7d

amogh-jahagirdar force-pushed the fix-dv-suffix branch from e13f67d to b342c7d Compare January 17, 2025 05:19

amogh-jahagirdar commented Jan 17, 2025

View reviewed changes

Fokko approved these changes Jan 17, 2025

View reviewed changes

nastra approved these changes Jan 17, 2025

View reviewed changes

amogh-jahagirdar merged commit 4d0f40c into apache:main Jan 17, 2025
31 checks passed

amogh-jahagirdar mentioned this pull request Jan 21, 2025

Spark 3.4: Backport writing DVs to Spark 3.4 #12019

Merged

aokolnychyi reviewed Jan 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Fix Puffin suffix for DV files #11986

Spark: Fix Puffin suffix for DV files #11986

amogh-jahagirdar commented Jan 17, 2025 •

edited

Loading

amogh-jahagirdar commented Jan 17, 2025 •

edited

Loading

amogh-jahagirdar Jan 17, 2025 •

edited

Loading

amogh-jahagirdar Jan 17, 2025 •

edited

Loading

Fokko Jan 17, 2025

amogh-jahagirdar Jan 17, 2025

aokolnychyi Jan 21, 2025

Fokko left a comment

amogh-jahagirdar commented Jan 17, 2025

aokolnychyi left a comment

Spark: Fix Puffin suffix for DV files #11986

Spark: Fix Puffin suffix for DV files #11986

Conversation

amogh-jahagirdar commented Jan 17, 2025 • edited Loading

amogh-jahagirdar commented Jan 17, 2025 • edited Loading

amogh-jahagirdar Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

amogh-jahagirdar Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Fokko Jan 17, 2025

Choose a reason for hiding this comment

amogh-jahagirdar Jan 17, 2025

Choose a reason for hiding this comment

aokolnychyi Jan 21, 2025

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

amogh-jahagirdar commented Jan 17, 2025

aokolnychyi left a comment

Choose a reason for hiding this comment

amogh-jahagirdar commented Jan 17, 2025 •

edited

Loading

amogh-jahagirdar commented Jan 17, 2025 •

edited

Loading

amogh-jahagirdar Jan 17, 2025 •

edited

Loading

amogh-jahagirdar Jan 17, 2025 •

edited

Loading