Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Fix Puffin suffix for DV files #11986

Merged
merged 1 commit into from
Jan 17, 2025

Conversation

amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Jan 17, 2025

Fixes #11968

Currently, when writing DVs in Spark, the incorrect file suffix is being used. This is because the file format passed to the output file factory which produces the actual filename is not correct and uses the default conf delete file format (typically the same as the data file format).

Note, the actual Puffin DVs were always being written when the table format is V3, it just had the wrong suffix so users see "delete-foo.parquet" even though the file is really a Puffin file with DVs.

@github-actions github-actions bot added the spark label Jan 17, 2025
@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Jan 17, 2025

Working on a good place to add tests for this...we generally don't test against the actual file paths in Iceberg since they're not necessary for correctness but in this case it naturally causes confusion for users so I think it's good to have some assertions on the actual file path.

@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-dv-suffix branch 2 times, most recently from c354dbe to 21a6ebb Compare January 17, 2025 03:36
@amogh-jahagirdar amogh-jahagirdar marked this pull request as ready for review January 17, 2025 03:37
@amogh-jahagirdar amogh-jahagirdar added this to the Iceberg 1.8.0 milestone Jan 17, 2025
@@ -723,7 +727,6 @@ public DeleteGranularity deleteGranularity() {
}

public boolean useDVs() {
TableOperations ops = ((HasTableOperations) table).operations();
return ops.current().formatVersion() >= 3;
return !(table instanceof BaseMetadataTable) && TableUtil.formatVersion(table) >= 3;
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this because it is possible to pass in metadata tables to SparkWriteConf and then previously the useDVs check would fail, for cases like RewritePositionDeletes which passes in the PositionMetadataTable. Metadata tables by definition are read only and I think it doesn't make sense for the writeConf.useDVs() API to return true.

RewritePositionDeletes also uses metadataTable.deleteFileFormat() for determining the file format to write with but I didn't want to change anything additional on the deleteFileFormat() path since that's used in quite a bit more places...

I also refactored to use the new util which also handles the case where the table is a SerializableTable

Comment on lines -725 to -728
public boolean useDVs() {
TableOperations ops = ((HasTableOperations) table).operations();
return ops.current().formatVersion() >= 3;
}
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing this since I realized it's not yet an officially released public API on SparkWriteConf. We can just update deleteFileFormat() to return PUFFIN and then use that in SparkPositionDeltaWrite, that simplifies the conf while solving the original issue of not surfacing the right format to the output file factory

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -214,6 +214,10 @@ private boolean fanoutWriterEnabled(boolean defaultValue) {
}

public FileFormat deleteFileFormat() {
if (!(table instanceof BaseMetadataTable) && TableUtil.formatVersion(table) >= 3) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #11986 (comment) for why I changed this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not need table instanceof BaseMetadataTable once we add support for minor DV compaction. It looks good now.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this.

cc @nastra as it has an overlap with #11588

@amogh-jahagirdar
Copy link
Contributor Author

Thanks for reviewing @nastra @Fokko!

@amogh-jahagirdar amogh-jahagirdar merged commit 4d0f40c into apache:main Jan 17, 2025
31 checks passed
Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late +1 from me too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Creating Delete Vectors using Java API or Spark
4 participants