-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove initial_change
when dealing with table updates
#950
Conversation
pyiceberg/table/__init__.py
Outdated
@@ -1129,12 +1128,22 @@ def _(update: SetSnapshotRefUpdate, base_metadata: TableMetadata, context: _Tabl | |||
return base_metadata.model_copy(update=metadata_updates) | |||
|
|||
|
|||
@_apply_table_update.register(RemoveSnapshotRefUpdate) | |||
def _(update: RemoveSnapshotRefUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kevinjqliu, I've already implemented _apply_table_update
for RemoveSnapshotRefUpdate
and the public facing remove branch / tag apis in #822, which is waiting on #758.
If you need to add this functionality urgently, please feel free to use the code in the linked PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Thanks for the heads up. I can rebase once those PRs are in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it'll be helpful to enforce any new TableUpdate class to have a corresponding update function?
Like in #952
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would helpful! In case a feature is not implemented, like RemoveSnapshotRefUpdate
and RemoveSnapshotsUpdate
, the test should atleast print a message saying so.
It would help keep track of which features are / are not implemented, and won't be surprise to the end user or us.
3bb0bbf
to
90bacbe
Compare
initial_change
when dealing with table updates
@HonahX do you mind taking a look at this when you get a chance? |
90bacbe
to
02d46d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being late. Thanks for the PR! The initial_change
was a workaround when createTableTransaction
was added. It will be great if we can find a better way to handle the case without this additional parameter.
@@ -104,8 +102,6 @@ class AddPartitionSpecUpdate(IcebergBaseModel): | |||
action: Literal["add-spec"] = Field(default="add-spec") | |||
spec: PartitionSpec | |||
|
|||
initial_change: bool = Field(default=False, exclude=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of removing it directly, shall we go through a deprecation process given this is a public class? We could add a deprecation message (via field validator?) when this field is set explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 makes sense!
context.add_update(update) | ||
if update.spec.spec_id == INITIAL_PARTITION_SPEC_ID: | ||
# no op | ||
return base_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to cause problem if I want to create a partitioned table from beginning. For example,
iceberg_schema = Schema(*[NestedField(field_id=1, name="a", field_type=StringType())])
iceberg_spec = PartitionSpec(*[PartitionField(source_id=1, field_id=1001, transform=IdentityTransform(), name='test1')])
sort_order = SortOrder(*[SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC)])
txn = catalog.create_table_transaction(identifier=identifier, schema=iceberg_schema, partition_spec=iceberg_spec, sort_order=sort_order)
txn.commit_transaction()
tbl = catalog.load_table(identifier)
print("=====Schemas====")
print(tbl.schemas())
print("=====Specs====")
print(tbl.specs())
print("=====SortOrders====")
print(tbl.sort_orders())
=====Schemas====
{0: Schema(NestedField(field_id=1, name='a', field_type=StringType(), required=False), schema_id=0, identifier_field_ids=[])}
=====Specs====
{0: PartitionSpec(spec_id=0)}
=====SortOrders====
{0: SortOrder(order_id=0), 1: SortOrder(SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_FIRST), order_id=1)}
Although iceberg_spec
is given, the table is still created with UNPARTITIONED_PARTITION_SPEC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the example.
on a meta level, this is the type of bug I'm afraid of when refactoring... how can we ensure other cases like this are captured
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe adding more tests for this specific case would be helpful. While we have extensive coverage for the logic of updating existing metadata, there are very few tests for create_table_transaction
, where updates are applied to an empty metadata set to generate the entire metadata. Since this logic is unique to create_table_transaction
, errors related to it are not detected by the current tests.
@_apply_table_update.register(AddSortOrderUpdate) | ||
def _(update: AddSortOrderUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata: | ||
context.add_update(update) | ||
if update.sort_order == UNSORTED_SORT_ORDER: | ||
# no op | ||
return base_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As shown in the example above, if I specify a SortOrder in the beginning, I end up getting a table with an additional empty SortOrder (UNSORTED)
=====SortOrders====
{0: SortOrder(order_id=0), 1: SortOrder(SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_FIRST), order_id=1)}
Hi @kevinjqliu I ran a few experiments and found that removing I’ve implemented this approach and created a draft PR: #1219. I’d love to hear your thoughts on it! |
Closing this in favor of #1219 |
Closes #864
Identified in #864,
TableMetadata
is initialized with the default Pydantic object for schema, partition_spec, and sort_order, which does not play well with table updates. Specifically, theinitial_change
field is an implementation detail of pyiceberg and does not play well when interacting with the REST API. Table update objects from the REST API does not understand this field.We can safely remove
initial_change
by modifying the logic for dealing with table updates.