-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug Fix: Rest Catalog update partition spec and sort order when fresh schema is created #392
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know if this is captured by the REST integration test? would be a good testcase if not
Yes, I'm in the process of writing it up now :) Thank you for the suggestion @kevinjqliu |
btw, i have a toy REST catalog here. its easier than spinning up the integration docker image every time.
Points to this the only thing in the catalog now is from the "getting started" guide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments on equality check.
Also, it looks like we are reassigning id
s from 1. I'm not familiar with this process so bare with me. Is there a reason why we can't create the new table with the original schema, starting at id 4 & 5?
PartitionField(source_id=1, field_id=1000, transform=IdentityTransform(), name="col_uuid"), spec_id=0 | ||
) | ||
expected_sort_order = SortOrder(SortField(source_id=2, transform=IdentityTransform())) | ||
assert tbl.schema() == expected_schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, should we also check tbl.schema().schema_id
since Schema
's __eq__
doesn't check for that
iceberg-python/pyiceberg/schema.py
Lines 104 to 118 in cec051f
def __eq__(self, other: Any) -> bool: | |
"""Return the equality of two instances of the Schema class.""" | |
if not other: | |
return False | |
if not isinstance(other, Schema): | |
return False | |
if len(self.columns) != len(other.columns): | |
return False | |
identifier_field_ids_is_equal = self.identifier_field_ids == other.identifier_field_ids | |
schema_is_equal = all(lhs == rhs for lhs, rhs in zip(self.columns, other.columns)) | |
return identifier_field_ids_is_equal and schema_is_equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. What do you think about just asserting that the schema_id is 0? Since it will always be 0 on creation?
) | ||
expected_sort_order = SortOrder(SortField(source_id=2, transform=IdentityTransform())) | ||
assert tbl.schema() == expected_schema | ||
assert tbl.spec() == expected_spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PartitionSpec
's __eq__
checks for the spec_id
iceberg-python/pyiceberg/partitioning.py
Lines 110 to 119 in cec051f
def __eq__(self, other: Any) -> bool: | |
""" | |
Produce a boolean to return True if two objects are considered equal. | |
Note: | |
Equality of PartitionSpec is determined by spec_id and partition fields only. | |
""" | |
if not isinstance(other, PartitionSpec): | |
return False | |
return self.spec_id == other.spec_id and self.fields == other.fields |
expected_sort_order = SortOrder(SortField(source_id=2, transform=IdentityTransform())) | ||
assert tbl.schema() == expected_schema | ||
assert tbl.spec() == expected_spec | ||
assert tbl.sort_order() == expected_sort_order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SortOrder
doesn't seem to have a __eq__
function defined.
iceberg-python/pyiceberg/table/sorting.py
Lines 127 to 164 in cec051f
class SortOrder(IcebergBaseModel): | |
"""Describes how the data is sorted within the table. | |
Users can sort their data within partitions by columns to gain performance. | |
The order of the sort fields within the list defines the order in which the sort is applied to the data. | |
Args: | |
fields (List[SortField]): The fields how the table is sorted. | |
Keyword Args: | |
order_id (int): An unique id of the sort-order of a table. | |
""" | |
order_id: int = Field(alias="order-id", default=INITIAL_SORT_ORDER_ID) | |
fields: List[SortField] = Field(default_factory=list) | |
def __init__(self, *fields: SortField, **data: Any): | |
if fields: | |
data["fields"] = fields | |
super().__init__(**data) | |
@property | |
def is_unsorted(self) -> bool: | |
return len(self.fields) == 0 | |
def __str__(self) -> str: | |
"""Return the string representation of the SortOrder class.""" | |
result_str = "[" | |
if self.fields: | |
result_str += "\n " + "\n ".join([str(field) for field in self.fields]) + "\n" | |
result_str += "]" | |
return result_str | |
def __repr__(self) -> str: | |
"""Return the string representation of the SortOrder class.""" | |
fields = f"{', '.join(repr(column) for column in self.fields)}, " if self.fields else "" | |
return f"SortOrder({fields}order_id={self.order_id})" |
What is the behavior here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it would invoke the eq of IcebergBaseModel -> BaseModel, which checks the equality of all of the fields of the BaseModel
I think we inherit this logic from the java code. When a new table is created, simply a new table is created - so we just assign IDs in pre-order traversal order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @syun64 for working on this and @kevinjqliu for reviewing.
Also, it looks like we are reassigning ids from 1. I'm not familiar with this process so bare with me. Is there a reason why we can't create the new table with the original schema, starting at id 4 & 5?
Adding to @syun64's explanation, doing this in RestCatalog Client is safer option in case the server implementation does not do this by default: #327 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch @syun64 thanks for fixing this 👍
Right now, if fresh schema IDs are assigned on the Iceberg table schema that doesn't map exactly to the schema that was originally passed in, the partition and sort order may map to a different field ID.
When fresh schema ID is assigned, the partition spec and sort order must also be generated based on the fresh schema before CreateTableRequest is sent to the rest catalog.
Thank you @jqin61 for identifying the issue!