feat(reader): position-based column projection for Parquet files without field IDs (migrated tables) #1777
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What issue does this PR close?
Partially address #1749.
Rationale for this change
Background: This issue was discovered when running Iceberg Java's test suite against our experimental DataFusion Comet branch that uses iceberg-rust. Many failures occurred in
TestMigrateTableAction.java, which tests reading Parquet files from migrated tables (e.g., from Hive or Spark) that lack embedded field ID metadata.Problem: The Rust ArrowReader was unable to read these files, while Iceberg Java handles them using a position-based fallback where top-level field ID N maps to top-level Parquet column position N-1, and entire columns (including nested content) are projected.
What changes are included in this PR?
This PR implements position-based column projection for Parquet files without field IDs, enabling iceberg-rust to read migrated tables.
Solution: Implemented fallback projection in
ArrowReader::get_arrow_projection_mask_fallback()that matches Java'sParquetSchemaUtil.pruneColumnsFallback()behavior:ProjectionMask::roots()to project entire columns including nested content (structs, lists, maps)RecordBatchTransformerRecordBatchTransformer)This implementation now matches Iceberg Java's behavior for reading migrated tables, enabling interoperability with Java-based tooling and workflows.
Are these changes tested?
Yes, comprehensive unit tests were added to verify the fallback path works correctly:
test_read_parquet_file_without_field_ids- Basic projection with primitive columns using position-based mappingtest_read_parquet_without_field_ids_partial_projection- Project subset of columnstest_read_parquet_without_field_ids_schema_evolution- Handle missing columns with NULL valuestest_read_parquet_without_field_ids_multiple_row_groups- Verify behavior across row group boundariestest_read_parquet_without_field_ids_with_struct- Project structs with nested fields (entire top-level column)test_read_parquet_without_field_ids_filter_eliminates_all_rows- Comet saw a panic when all row groups were filtered out, this reproduces that scenariotest_read_parquet_without_field_ids_schema_evolution_add_column_in_middle- Schema evolution with a column in the middle caused a panic at one pointAll tests verify that behavior matches Iceberg Java's
pruneColumnsFallback()implementation in/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java.