Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Fix: Rest Catalog update partition spec and sort order when fresh schema is created #392

Merged
merged 4 commits into from
Feb 8, 2024

Conversation

sungwy
Copy link
Collaborator

@sungwy sungwy commented Feb 7, 2024

Right now, if fresh schema IDs are assigned on the Iceberg table schema that doesn't map exactly to the schema that was originally passed in, the partition and sort order may map to a different field ID.

When fresh schema ID is assigned, the partition spec and sort order must also be generated based on the fresh schema before CreateTableRequest is sent to the rest catalog.

Thank you @jqin61 for identifying the issue!

@sungwy sungwy requested a review from Fokko February 7, 2024 20:57
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if this is captured by the REST integration test? would be a good testcase if not

@sungwy
Copy link
Collaborator Author

sungwy commented Feb 7, 2024

do you know if this is captured by the REST integration test? would be a good testcase if not

Yes, I'm in the process of writing it up now :)

Thank you for the suggestion @kevinjqliu

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Feb 7, 2024

btw, i have a toy REST catalog here. its easier than spinning up the integration docker image every time.

~/.pyiceberg.yaml

catalog:
  default:
    uri: https://iceberg-rest-image.fly.dev

Points to this
https://iceberg-rest-image.fly.dev/v1/namespaces/default/tables/taxi_dataset

the only thing in the catalog now is from the "getting started" guide

@sungwy sungwy requested a review from kevinjqliu February 7, 2024 21:24
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments on equality check.

Also, it looks like we are reassigning ids from 1. I'm not familiar with this process so bare with me. Is there a reason why we can't create the new table with the original schema, starting at id 4 & 5?

PartitionField(source_id=1, field_id=1000, transform=IdentityTransform(), name="col_uuid"), spec_id=0
)
expected_sort_order = SortOrder(SortField(source_id=2, transform=IdentityTransform()))
assert tbl.schema() == expected_schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, should we also check tbl.schema().schema_id since Schema's __eq__ doesn't check for that

def __eq__(self, other: Any) -> bool:
"""Return the equality of two instances of the Schema class."""
if not other:
return False
if not isinstance(other, Schema):
return False
if len(self.columns) != len(other.columns):
return False
identifier_field_ids_is_equal = self.identifier_field_ids == other.identifier_field_ids
schema_is_equal = all(lhs == rhs for lhs, rhs in zip(self.columns, other.columns))
return identifier_field_ids_is_equal and schema_is_equal

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. What do you think about just asserting that the schema_id is 0? Since it will always be 0 on creation?

)
expected_sort_order = SortOrder(SortField(source_id=2, transform=IdentityTransform()))
assert tbl.schema() == expected_schema
assert tbl.spec() == expected_spec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PartitionSpec's __eq__ checks for the spec_id

def __eq__(self, other: Any) -> bool:
"""
Produce a boolean to return True if two objects are considered equal.
Note:
Equality of PartitionSpec is determined by spec_id and partition fields only.
"""
if not isinstance(other, PartitionSpec):
return False
return self.spec_id == other.spec_id and self.fields == other.fields

expected_sort_order = SortOrder(SortField(source_id=2, transform=IdentityTransform()))
assert tbl.schema() == expected_schema
assert tbl.spec() == expected_spec
assert tbl.sort_order() == expected_sort_order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SortOrder doesn't seem to have a __eq__ function defined.

class SortOrder(IcebergBaseModel):
"""Describes how the data is sorted within the table.
Users can sort their data within partitions by columns to gain performance.
The order of the sort fields within the list defines the order in which the sort is applied to the data.
Args:
fields (List[SortField]): The fields how the table is sorted.
Keyword Args:
order_id (int): An unique id of the sort-order of a table.
"""
order_id: int = Field(alias="order-id", default=INITIAL_SORT_ORDER_ID)
fields: List[SortField] = Field(default_factory=list)
def __init__(self, *fields: SortField, **data: Any):
if fields:
data["fields"] = fields
super().__init__(**data)
@property
def is_unsorted(self) -> bool:
return len(self.fields) == 0
def __str__(self) -> str:
"""Return the string representation of the SortOrder class."""
result_str = "["
if self.fields:
result_str += "\n " + "\n ".join([str(field) for field in self.fields]) + "\n"
result_str += "]"
return result_str
def __repr__(self) -> str:
"""Return the string representation of the SortOrder class."""
fields = f"{', '.join(repr(column) for column in self.fields)}, " if self.fields else ""
return f"SortOrder({fields}order_id={self.order_id})"

What is the behavior here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it would invoke the eq of IcebergBaseModel -> BaseModel, which checks the equality of all of the fields of the BaseModel

@sungwy
Copy link
Collaborator Author

sungwy commented Feb 7, 2024

some comments on equality check.

Also, it looks like we are reassigning ids from 1. I'm not familiar with this process so bare with me. Is there a reason why we can't create the new table with the original schema, starting at id 4 & 5?

I think we inherit this logic from the java code. When a new table is created, simply a new table is created - so we just assign IDs in pre-order traversal order

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @syun64 for working on this and @kevinjqliu for reviewing.

Also, it looks like we are reassigning ids from 1. I'm not familiar with this process so bare with me. Is there a reason why we can't create the new table with the original schema, starting at id 4 & 5?

Adding to @syun64's explanation, doing this in RestCatalog Client is safer option in case the server implementation does not do this by default: #327 (review)

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @syun64 thanks for fixing this 👍

@Fokko Fokko merged commit ef33b9d into apache:main Feb 8, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants