Remove `initial_change` when dealing with table updates #950

kevinjqliu · 2024-07-21T18:05:40Z

Closes #864

Identified in #864, TableMetadata is initialized with the default Pydantic object for schema, partition_spec, and sort_order, which does not play well with table updates. Specifically, the initial_change field is an implementation detail of pyiceberg and does not play well when interacting with the REST API. Table update objects from the REST API does not understand this field.
We can safely remove initial_change by modifying the logic for dealing with table updates.

chinmay-bhat · 2024-07-23T14:00:13Z

pyiceberg/table/__init__.py

@@ -1129,12 +1128,22 @@ def _(update: SetSnapshotRefUpdate, base_metadata: TableMetadata, context: _Tabl
    return base_metadata.model_copy(update=metadata_updates)


+@_apply_table_update.register(RemoveSnapshotRefUpdate)
+def _(update: RemoveSnapshotRefUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata:


Hi @kevinjqliu, I've already implemented _apply_table_update for RemoveSnapshotRefUpdate and the public facing remove branch / tag apis in #822, which is waiting on #758.

If you need to add this functionality urgently, please feel free to use the code in the linked PRs.

Cool! Thanks for the heads up. I can rebase once those PRs are in.

Do you think it'll be helpful to enforce any new TableUpdate class to have a corresponding update function?
Like in #952

I think that would helpful! In case a feature is not implemented, like RemoveSnapshotRefUpdate and RemoveSnapshotsUpdate, the test should atleast print a message saying so.
It would help keep track of which features are / are not implemented, and won't be surprise to the end user or us.

kevinjqliu · 2024-09-11T17:32:24Z

@HonahX do you mind taking a look at this when you get a chance?

HonahX

Sorry for being late. Thanks for the PR! The initial_change was a workaround when createTableTransaction was added. It will be great if we can find a better way to handle the case without this additional parameter.

HonahX · 2024-09-30T08:37:15Z

pyiceberg/table/update/__init__.py

@@ -104,8 +102,6 @@ class AddPartitionSpecUpdate(IcebergBaseModel):
    action: Literal["add-spec"] = Field(default="add-spec")
    spec: PartitionSpec

-    initial_change: bool = Field(default=False, exclude=True)


Instead of removing it directly, shall we go through a deprecation process given this is a public class? We could add a deprecation message (via field validator?) when this field is set explicitly.

+1 makes sense!

HonahX · 2024-09-30T08:48:35Z

pyiceberg/table/update/__init__.py

+    context.add_update(update)
+    if update.spec.spec_id == INITIAL_PARTITION_SPEC_ID:
+        # no op
+        return base_metadata


This seems to cause problem if I want to create a partitioned table from beginning. For example,

iceberg_schema = Schema(*[NestedField(field_id=1, name="a", field_type=StringType())]) iceberg_spec = PartitionSpec(*[PartitionField(source_id=1, field_id=1001, transform=IdentityTransform(), name='test1')]) sort_order = SortOrder(*[SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC)]) txn = catalog.create_table_transaction(identifier=identifier, schema=iceberg_schema, partition_spec=iceberg_spec, sort_order=sort_order) txn.commit_transaction() tbl = catalog.load_table(identifier) print("=====Schemas====") print(tbl.schemas()) print("=====Specs====") print(tbl.specs()) print("=====SortOrders====") print(tbl.sort_orders()) =====Schemas==== {0: Schema(NestedField(field_id=1, name='a', field_type=StringType(), required=False), schema_id=0, identifier_field_ids=[])} =====Specs==== {0: PartitionSpec(spec_id=0)} =====SortOrders==== {0: SortOrder(order_id=0), 1: SortOrder(SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_FIRST), order_id=1)}

Although iceberg_spec is given, the table is still created with UNPARTITIONED_PARTITION_SPEC

Thanks for the example.

on a meta level, this is the type of bug I'm afraid of when refactoring... how can we ensure other cases like this are captured

I believe adding more tests for this specific case would be helpful. While we have extensive coverage for the logic of updating existing metadata, there are very few tests for create_table_transaction, where updates are applied to an empty metadata set to generate the entire metadata. Since this logic is unique to create_table_transaction, errors related to it are not detected by the current tests.

HonahX · 2024-09-30T08:51:40Z

pyiceberg/table/update/__init__.py

 @_apply_table_update.register(AddSortOrderUpdate)
 def _(update: AddSortOrderUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata:
    context.add_update(update)
+    if update.sort_order == UNSORTED_SORT_ORDER:
+        # no op
+        return base_metadata


As shown in the example above, if I specify a SortOrder in the beginning, I end up getting a table with an additional empty SortOrder (UNSORTED)

=====SortOrders==== {0: SortOrder(order_id=0), 1: SortOrder(SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_FIRST), order_id=1)}

HonahX · 2024-10-06T08:48:08Z

Hi @kevinjqliu I ran a few experiments and found that removing initial_change would be challenging unless we can temporarily disable Pydantic’s validators during update_table_metadata. Fortunately, I found that Pydantic’s model_construct can help bypass the validators in this context.

I’ve implemented this approach and created a draft PR: #1219. I’d love to hear your thoughts on it!

kevinjqliu · 2024-10-29T22:19:11Z

Closing this in favor of #1219

This was referenced Jul 21, 2024

[🐞] Collection of a few bugs #864

Closed

Add test to ensure every table update has corresponding _apply_table_update function #952

Open

chinmay-bhat reviewed Jul 23, 2024

View reviewed changes

kevinjqliu force-pushed the kevinjqliu/iceberg-rest-catalog branch from 3bb0bbf to 90bacbe Compare September 10, 2024 17:38

kevinjqliu marked this pull request as ready for review September 10, 2024 18:05

kevinjqliu changed the title ~~[wip] Table updates~~ Remove initial_change when dealing with table updates Sep 11, 2024

kevinjqliu requested review from Fokko, sungwy and HonahX September 11, 2024 17:32

remove initial_change

02d46d5

kevinjqliu force-pushed the kevinjqliu/iceberg-rest-catalog branch from 90bacbe to 02d46d5 Compare September 13, 2024 19:11

HonahX reviewed Sep 30, 2024

View reviewed changes

HonahX mentioned this pull request Oct 6, 2024

Remove initial_change when CreateTableTransaction apply table updates on an empty metadata #1219

Merged

kevinjqliu closed this Oct 29, 2024

kevinjqliu deleted the kevinjqliu/iceberg-rest-catalog branch October 29, 2024 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove `initial_change` when dealing with table updates #950

Remove `initial_change` when dealing with table updates #950

Uh oh!

kevinjqliu commented Jul 21, 2024 •

edited

Loading

Uh oh!

chinmay-bhat Jul 23, 2024 •

edited

Loading

Uh oh!

kevinjqliu Jul 23, 2024

Uh oh!

kevinjqliu Jul 23, 2024

Uh oh!

chinmay-bhat Jul 29, 2024

Uh oh!

kevinjqliu commented Sep 11, 2024

Uh oh!

HonahX left a comment

Uh oh!

HonahX Sep 30, 2024

Uh oh!

kevinjqliu Sep 30, 2024

Uh oh!

HonahX Sep 30, 2024

Uh oh!

kevinjqliu Sep 30, 2024

Uh oh!

HonahX Oct 6, 2024

Uh oh!

HonahX Sep 30, 2024

Uh oh!

HonahX commented Oct 6, 2024

Uh oh!

kevinjqliu commented Oct 29, 2024

Uh oh!

Uh oh!

Remove initial_change when dealing with table updates #950

Remove initial_change when dealing with table updates #950

Uh oh!

Conversation

kevinjqliu commented Jul 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chinmay-bhat Jul 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinjqliu commented Sep 11, 2024

Uh oh!

HonahX left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HonahX commented Oct 6, 2024

Uh oh!

kevinjqliu commented Oct 29, 2024

Uh oh!

Uh oh!

Remove `initial_change` when dealing with table updates #950

Remove `initial_change` when dealing with table updates #950

kevinjqliu commented Jul 21, 2024 •

edited

Loading

chinmay-bhat Jul 23, 2024 •

edited

Loading