Implement column projection #1443

gabeiglio · 2024-12-18T20:26:02Z

This is a fix for issue #1401. In which table scans needed to infer partition column by following the column projection rules

Fixes #1401

…ction together

…an initial-default

kevinjqliu

Added a few comments, please take a look! The PR looks great already. Thanks for working on this!

pyiceberg/io/pyarrow.py

tests/io/test_pyarrow.py

pyiceberg/io/pyarrow.py

…tion logic to helper method, changed test to use high-level table scan

pyiceberg/io/pyarrow.py

tests/io/test_pyarrow.py

pyiceberg/io/pyarrow.py

…test

kevinjqliu

generally LGTM! I added a few nit comments and some clarifying questions on testing.

thanks for working on this!

tests/io/test_pyarrow.py

pyiceberg/io/pyarrow.py

kevinjqliu · 2025-01-20T18:57:15Z

pyiceberg/io/pyarrow.py

+        project_schema_diff = projected_field_ids.difference(file_project_schema.field_ids)
+        should_project_columns = len(project_schema_diff) > 0
+
+        projected_missing_fields = {}
+
+        if should_project_columns and partition_spec is not None:
+            projected_missing_fields = _get_column_projection_values(
+                task.file, projected_schema, project_schema_diff, partition_spec
+            )


Nit: wdyt about structuring the code like this?

Suggested change

project_schema_diff = projected_field_ids.difference(file_project_schema.field_ids)

should_project_columns = len(project_schema_diff) > 0

projected_missing_fields = {}

if should_project_columns and partition_spec is not None:

projected_missing_fields = _get_column_projection_values(

task.file, projected_schema, project_schema_diff, partition_spec

)

should_project_columns, projected_missing_fields = _get_column_projection_values(

task.file, projected_schema, partition_spec

)

and in _get_column_projection_values, move the rest of the logic

def _get_column_projection_values(...): project_schema_diff = projected_field_ids.difference(file_project_schema.field_ids) should_project_columns = len(project_schema_diff) > 0 projected_missing_fields = {} if not should_project_columns: return False, {} ...

tests/io/test_pyarrow.py

kevinjqliu · 2025-01-20T19:04:16Z

tests/io/test_pyarrow.py

+    partition_spec = PartitionSpec(
+        PartitionField(2, 1000, VoidTransform(), "void_partition_id"),
+        PartitionField(2, 1001, IdentityTransform(), "partition_id"),
+    )


i think we'd want to test multiple IdentityTransforms here.

im thinking about a case for multiple-level of partitioning in hive-style.

s3://my_table/a=100/b=foo/...parquet

i think _get_column_projection_values might not support this right now

Gabriel Igliozzi and others added 3 commits December 18, 2024 12:01

Initial commit for fix

f814ee1

Add test and commit lint changes

cf36660

Merge branch 'apache:main' into specPartitionIdentity

2fb6a16

Fokko self-requested a review December 18, 2024 20:38

Gabriel Igliozzi added 3 commits December 19, 2024 00:19

default-value bug fixes and adding more tests

7982465

Add continue, check file_schema before using it, group steps of proje…

e4d5882

…ction together

Fix lint issues, reorder partition spec to be of higher importance th…

694a52d

…an initial-default

gabeiglio marked this pull request as ready for review December 19, 2024 15:12

kevinjqliu reviewed Dec 19, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

tests/io/test_pyarrow.py Outdated Show resolved Hide resolved

pyiceberg/io/pyarrow.py Show resolved Hide resolved

Fokko reviewed Dec 20, 2024

View reviewed changes