Spark: Test metadata tables with format-version=3 #12135

nastra · 2025-01-30T10:32:11Z

No description provided.

nastra · 2025-01-30T10:33:30Z

...5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMetadataTables.java

+  private int formatVersion;
+
+  @Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}, formatVersion = {3}")
+  protected static Object[][] parameters() {


as defined in CatalogTestBase the default catalogs that were previously tested were Hadoop/Hive/Spark/REST. I'm limiting this here to Spark + REST, since we're adding testing for format version 2 + 3 here

does this mean we drop the coverage for hadoop/hive catalog for format version 2 while add format version 3 for spark and rest catalog?

yes, but the important piece here is rather to test V2 and V3. I don't think we want to test all possible combinations of catalogs and V2 + V3 as that would be a large test matrix

I think we can just add following lines to keep the original v2 coverage and then selectively enable v3 for catalogs which are ready. Since many iceberg users are still in the process of adopting REST based catalog and v2 is our default spec at the moment, I think keep the current coverage in place to test against metadata table change would be essential.

{ SparkCatalogConfig.HIVE.catalogName(), SparkCatalogConfig.HIVE.implementation(), SparkCatalogConfig.HIVE.properties(), 2 }, { SparkCatalogConfig.HADOOP.catalogName(), SparkCatalogConfig.HADOOP.implementation(), SparkCatalogConfig.HADOOP.properties(), 2 },

nastra · 2025-01-30T10:35:29Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterator.java

@@ -84,6 +84,9 @@ public InternalRow next() {
          rowValues.add(deleteFile.contentOffset());
        } else if (fieldId == MetadataColumns.CONTENT_SIZE_IN_BYTES_COLUMN_ID) {
          rowValues.add(ScanTaskUtil.contentSizeInBytes(deleteFile));
+        } else if (fieldId == MetadataColumns.DELETE_FILE_ROW_FIELD_ID) {
+          // DVs don't track the row that was deleted
+          rowValues.add(null);


this fixes an issue when reading from the .position_deletes table when the underlying table is V3. By default, the schema includes

iceberg/core/src/main/java/org/apache/iceberg/PositionDeletesTable.java

Line 123 in 026a9b0

MetadataColumns.DELETE_FILE_ROW_FIELD_ID,

which we don't track for DVs, thus we're returning null here

amogh-jahagirdar · 2025-01-30T16:50:20Z

...5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMetadataTables.java

+    assertThat(sql("SELECT * FROM %s.delete_files", tableName)).hasSize(1);
+
+    // check position_deletes table
+    assertThat(sql("SELECT * FROM %s.position_deletes", tableName)).hasSize(2);


Should we assert the contents of the rows? I can see how it's a bit of a pain between v2/v3 but feels like a stronger assertion.

Spark: Test metadata tables with format-version=3

95e7d3a

github-actions bot added the spark label Jan 30, 2025

nastra commented Jan 30, 2025

View reviewed changes

nastra requested a review from amogh-jahagirdar January 30, 2025 10:36

amogh-jahagirdar reviewed Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Test metadata tables with format-version=3 #12135

Spark: Test metadata tables with format-version=3 #12135

nastra commented Jan 30, 2025

nastra Jan 30, 2025

dramaticlly Jan 30, 2025

nastra Jan 31, 2025

dramaticlly Jan 31, 2025

nastra Jan 30, 2025

amogh-jahagirdar Jan 30, 2025

Spark: Test metadata tables with format-version=3 #12135

Are you sure you want to change the base?

Spark: Test metadata tables with format-version=3 #12135

Conversation

nastra commented Jan 30, 2025

nastra Jan 30, 2025

Choose a reason for hiding this comment

dramaticlly Jan 30, 2025

Choose a reason for hiding this comment

nastra Jan 31, 2025

Choose a reason for hiding this comment

dramaticlly Jan 31, 2025

Choose a reason for hiding this comment

nastra Jan 30, 2025

Choose a reason for hiding this comment

amogh-jahagirdar Jan 30, 2025

Choose a reason for hiding this comment