-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement column projection #1443
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments, please take a look! The PR looks great already. Thanks for working on this!
…tion logic to helper method, changed test to use high-level table scan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally LGTM! I added a few nit comments and some clarifying questions on testing.
thanks for working on this!
project_schema_diff = projected_field_ids.difference(file_project_schema.field_ids) | ||
should_project_columns = len(project_schema_diff) > 0 | ||
|
||
projected_missing_fields = {} | ||
|
||
if should_project_columns and partition_spec is not None: | ||
projected_missing_fields = _get_column_projection_values( | ||
task.file, projected_schema, project_schema_diff, partition_spec | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: wdyt about structuring the code like this?
project_schema_diff = projected_field_ids.difference(file_project_schema.field_ids) | |
should_project_columns = len(project_schema_diff) > 0 | |
projected_missing_fields = {} | |
if should_project_columns and partition_spec is not None: | |
projected_missing_fields = _get_column_projection_values( | |
task.file, projected_schema, project_schema_diff, partition_spec | |
) | |
should_project_columns, projected_missing_fields = _get_column_projection_values( | |
task.file, projected_schema, partition_spec | |
) |
and in _get_column_projection_values
, move the rest of the logic
def _get_column_projection_values(...):
project_schema_diff = projected_field_ids.difference(file_project_schema.field_ids)
should_project_columns = len(project_schema_diff) > 0
projected_missing_fields = {}
if not should_project_columns:
return False, {}
...
partition_spec = PartitionSpec( | ||
PartitionField(2, 1000, VoidTransform(), "void_partition_id"), | ||
PartitionField(2, 1001, IdentityTransform(), "partition_id"), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we'd want to test multiple IdentityTransform
s here.
im thinking about a case for multiple-level of partitioning in hive-style.
s3://my_table/a=100/b=foo/...parquet
i think _get_column_projection_values
might not support this right now
This is a fix for issue #1401. In which table scans needed to infer partition column by following the column projection rules
Fixes #1401