Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove initial_change when dealing with table updates #950

Closed

Conversation

kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Jul 21, 2024

Closes #864

Identified in #864, TableMetadata is initialized with the default Pydantic object for schema, partition_spec, and sort_order, which does not play well with table updates. Specifically, the initial_change field is an implementation detail of pyiceberg and does not play well when interacting with the REST API. Table update objects from the REST API does not understand this field.
We can safely remove initial_change by modifying the logic for dealing with table updates.

@@ -1129,12 +1128,22 @@ def _(update: SetSnapshotRefUpdate, base_metadata: TableMetadata, context: _Tabl
return base_metadata.model_copy(update=metadata_updates)


@_apply_table_update.register(RemoveSnapshotRefUpdate)
def _(update: RemoveSnapshotRefUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata:
Copy link
Contributor

@chinmay-bhat chinmay-bhat Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kevinjqliu, I've already implemented _apply_table_update for RemoveSnapshotRefUpdate and the public facing remove branch / tag apis in #822, which is waiting on #758.

If you need to add this functionality urgently, please feel free to use the code in the linked PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Thanks for the heads up. I can rebase once those PRs are in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it'll be helpful to enforce any new TableUpdate class to have a corresponding update function?
Like in #952

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would helpful! In case a feature is not implemented, like RemoveSnapshotRefUpdate and RemoveSnapshotsUpdate, the test should atleast print a message saying so.
It would help keep track of which features are / are not implemented, and won't be surprise to the end user or us.

@kevinjqliu kevinjqliu force-pushed the kevinjqliu/iceberg-rest-catalog branch from 3bb0bbf to 90bacbe Compare September 10, 2024 17:38
@kevinjqliu kevinjqliu marked this pull request as ready for review September 10, 2024 18:05
@kevinjqliu kevinjqliu changed the title [wip] Table updates Remove initial_change when dealing with table updates Sep 11, 2024
@kevinjqliu
Copy link
Contributor Author

@HonahX do you mind taking a look at this when you get a chance?

@kevinjqliu kevinjqliu force-pushed the kevinjqliu/iceberg-rest-catalog branch from 90bacbe to 02d46d5 Compare September 13, 2024 19:11
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being late. Thanks for the PR! The initial_change was a workaround when createTableTransaction was added. It will be great if we can find a better way to handle the case without this additional parameter.

@@ -104,8 +102,6 @@ class AddPartitionSpecUpdate(IcebergBaseModel):
action: Literal["add-spec"] = Field(default="add-spec")
spec: PartitionSpec

initial_change: bool = Field(default=False, exclude=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of removing it directly, shall we go through a deprecation process given this is a public class? We could add a deprecation message (via field validator?) when this field is set explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 makes sense!

context.add_update(update)
if update.spec.spec_id == INITIAL_PARTITION_SPEC_ID:
# no op
return base_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to cause problem if I want to create a partitioned table from beginning. For example,

    iceberg_schema = Schema(*[NestedField(field_id=1, name="a", field_type=StringType())])
    iceberg_spec = PartitionSpec(*[PartitionField(source_id=1, field_id=1001, transform=IdentityTransform(), name='test1')])
    sort_order = SortOrder(*[SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC)])

    txn = catalog.create_table_transaction(identifier=identifier, schema=iceberg_schema, partition_spec=iceberg_spec, sort_order=sort_order)
    txn.commit_transaction()

    tbl = catalog.load_table(identifier)

    print("=====Schemas====")
    print(tbl.schemas())
    print("=====Specs====")
    print(tbl.specs())
    print("=====SortOrders====")
    print(tbl.sort_orders())


=====Schemas====
{0: Schema(NestedField(field_id=1, name='a', field_type=StringType(), required=False), schema_id=0, identifier_field_ids=[])}
=====Specs====
{0: PartitionSpec(spec_id=0)}
=====SortOrders====
{0: SortOrder(order_id=0), 1: SortOrder(SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_FIRST), order_id=1)}

Although iceberg_spec is given, the table is still created with UNPARTITIONED_PARTITION_SPEC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example.

on a meta level, this is the type of bug I'm afraid of when refactoring... how can we ensure other cases like this are captured

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe adding more tests for this specific case would be helpful. While we have extensive coverage for the logic of updating existing metadata, there are very few tests for create_table_transaction, where updates are applied to an empty metadata set to generate the entire metadata. Since this logic is unique to create_table_transaction, errors related to it are not detected by the current tests.

@_apply_table_update.register(AddSortOrderUpdate)
def _(update: AddSortOrderUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata:
context.add_update(update)
if update.sort_order == UNSORTED_SORT_ORDER:
# no op
return base_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As shown in the example above, if I specify a SortOrder in the beginning, I end up getting a table with an additional empty SortOrder (UNSORTED)

=====SortOrders====
{0: SortOrder(order_id=0), 1: SortOrder(SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_FIRST), order_id=1)}

@HonahX
Copy link
Contributor

HonahX commented Oct 6, 2024

Hi @kevinjqliu I ran a few experiments and found that removing initial_change would be challenging unless we can temporarily disable Pydantic’s validators during update_table_metadata. Fortunately, I found that Pydantic’s model_construct can help bypass the validators in this context.

I’ve implemented this approach and created a draft PR: #1219. I’d love to hear your thoughts on it!

@kevinjqliu
Copy link
Contributor Author

Closing this in favor of #1219

@kevinjqliu kevinjqliu closed this Oct 29, 2024
@kevinjqliu kevinjqliu deleted the kevinjqliu/iceberg-rest-catalog branch October 29, 2024 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[🐞] Collection of a few bugs
3 participants