Bug Fix: Rest Catalog update partition spec and sort order when fresh schema is created #392

sungwy · 2024-02-07T20:57:54Z

Right now, if fresh schema IDs are assigned on the Iceberg table schema that doesn't map exactly to the schema that was originally passed in, the partition and sort order may map to a different field ID.

When fresh schema ID is assigned, the partition spec and sort order must also be generated based on the fresh schema before CreateTableRequest is sent to the rest catalog.

Thank you @jqin61 for identifying the issue!

kevinjqliu

do you know if this is captured by the REST integration test? would be a good testcase if not

sungwy · 2024-02-07T21:14:39Z

do you know if this is captured by the REST integration test? would be a good testcase if not

Yes, I'm in the process of writing it up now :)

Thank you for the suggestion @kevinjqliu

kevinjqliu · 2024-02-07T21:20:39Z

btw, i have a toy REST catalog here. its easier than spinning up the integration docker image every time.

~/.pyiceberg.yaml

catalog:
  default:
    uri: https://iceberg-rest-image.fly.dev

Points to this
https://iceberg-rest-image.fly.dev/v1/namespaces/default/tables/taxi_dataset

the only thing in the catalog now is from the "getting started" guide

kevinjqliu

some comments on equality check.

Also, it looks like we are reassigning ids from 1. I'm not familiar with this process so bare with me. Is there a reason why we can't create the new table with the original schema, starting at id 4 & 5?

kevinjqliu · 2024-02-07T21:30:31Z

tests/integration/test_rest_schema.py

+        PartitionField(source_id=1, field_id=1000, transform=IdentityTransform(), name="col_uuid"), spec_id=0
+    )
+    expected_sort_order = SortOrder(SortField(source_id=2, transform=IdentityTransform()))
+    assert tbl.schema() == expected_schema


nit, should we also check tbl.schema().schema_id since Schema's __eq__ doesn't check for that

iceberg-python/pyiceberg/schema.py

Lines 104 to 118 in cec051f

def __eq__(self, other: Any) -> bool:

"""Return the equality of two instances of the Schema class."""

if not other:

return False

if not isinstance(other, Schema):

return False

if len(self.columns) != len(other.columns):

return False

identifier_field_ids_is_equal = self.identifier_field_ids == other.identifier_field_ids

schema_is_equal = all(lhs == rhs for lhs, rhs in zip(self.columns, other.columns))

return identifier_field_ids_is_equal and schema_is_equal

Sure. What do you think about just asserting that the schema_id is 0? Since it will always be 0 on creation?

kevinjqliu · 2024-02-07T21:31:31Z

tests/integration/test_rest_schema.py

+    )
+    expected_sort_order = SortOrder(SortField(source_id=2, transform=IdentityTransform()))
+    assert tbl.schema() == expected_schema
+    assert tbl.spec() == expected_spec


PartitionSpec's __eq__ checks for the spec_id

iceberg-python/pyiceberg/partitioning.py

Lines 110 to 119 in cec051f

def __eq__(self, other: Any) -> bool:

"""

Produce a boolean to return True if two objects are considered equal.

Note:

Equality of PartitionSpec is determined by spec_id and partition fields only.

"""

if not isinstance(other, PartitionSpec):

return False

return self.spec_id == other.spec_id and self.fields == other.fields

kevinjqliu · 2024-02-07T21:32:24Z

tests/integration/test_rest_schema.py

+    expected_sort_order = SortOrder(SortField(source_id=2, transform=IdentityTransform()))
+    assert tbl.schema() == expected_schema
+    assert tbl.spec() == expected_spec
+    assert tbl.sort_order() == expected_sort_order


SortOrder doesn't seem to have a __eq__ function defined.

iceberg-python/pyiceberg/table/sorting.py

Lines 127 to 164 in cec051f

class SortOrder(IcebergBaseModel):

"""Describes how the data is sorted within the table.

Users can sort their data within partitions by columns to gain performance.

The order of the sort fields within the list defines the order in which the sort is applied to the data.

Args:

fields (List[SortField]): The fields how the table is sorted.

Keyword Args:

order_id (int): An unique id of the sort-order of a table.

"""

order_id: int = Field(alias="order-id", default=INITIAL_SORT_ORDER_ID)

fields: List[SortField] = Field(default_factory=list)

def __init__(self, *fields: SortField, **data: Any):

if fields:

data["fields"] = fields

super().__init__(**data)

@property

def is_unsorted(self) -> bool:

return len(self.fields) == 0

def __str__(self) -> str:

"""Return the string representation of the SortOrder class."""

result_str = "["

if self.fields:

result_str += "\n " + "\n ".join([str(field) for field in self.fields]) + "\n"

result_str += "]"

return result_str

def __repr__(self) -> str:

"""Return the string representation of the SortOrder class."""

fields = f"{', '.join(repr(column) for column in self.fields)}, " if self.fields else ""

return f"SortOrder({fields}order_id={self.order_id})"

What is the behavior here?

I believe it would invoke the eq of IcebergBaseModel -> BaseModel, which checks the equality of all of the fields of the BaseModel

sungwy · 2024-02-07T21:46:47Z

some comments on equality check.

Also, it looks like we are reassigning ids from 1. I'm not familiar with this process so bare with me. Is there a reason why we can't create the new table with the original schema, starting at id 4 & 5?

I think we inherit this logic from the java code. When a new table is created, simply a new table is created - so we just assign IDs in pre-order traversal order

HonahX

LGTM! Thanks @syun64 for working on this and @kevinjqliu for reviewing.

Also, it looks like we are reassigning ids from 1. I'm not familiar with this process so bare with me. Is there a reason why we can't create the new table with the original schema, starting at id 4 & 5?

Adding to @syun64's explanation, doing this in RestCatalog Client is safer option in case the server implementation does not do this by default: #327 (review)

Fokko

Good catch @syun64 thanks for fixing this 👍

also update partition spec and sort order when fresh schema is created

4e13a5d

sungwy requested a review from Fokko February 7, 2024 20:57

fresh-schema

70958ce

kevinjqliu reviewed Feb 7, 2024

View reviewed changes

sungwy added 2 commits February 7, 2024 21:22

create table integrity test

cce8a7b

undo test change

4e4d2b6

sungwy requested a review from kevinjqliu February 7, 2024 21:24

kevinjqliu reviewed Feb 7, 2024

View reviewed changes

HonahX approved these changes Feb 8, 2024

View reviewed changes

Fokko approved these changes Feb 8, 2024

View reviewed changes

Fokko merged commit ef33b9d into apache:main Feb 8, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Fix: Rest Catalog update partition spec and sort order when fresh schema is created #392

Bug Fix: Rest Catalog update partition spec and sort order when fresh schema is created #392

sungwy commented Feb 7, 2024 •

edited

Loading

kevinjqliu left a comment

sungwy commented Feb 7, 2024

kevinjqliu commented Feb 7, 2024 •

edited

Loading

kevinjqliu left a comment

kevinjqliu Feb 7, 2024

sungwy Feb 7, 2024

kevinjqliu Feb 7, 2024

kevinjqliu Feb 7, 2024

sungwy Feb 7, 2024

sungwy commented Feb 7, 2024

HonahX left a comment

Fokko left a comment

	def __eq__(self, other: Any) -> bool:
	"""Return the equality of two instances of the Schema class."""
	if not other:
	return False

	if not isinstance(other, Schema):
	return False

	if len(self.columns) != len(other.columns):
	return False

	identifier_field_ids_is_equal = self.identifier_field_ids == other.identifier_field_ids
	schema_is_equal = all(lhs == rhs for lhs, rhs in zip(self.columns, other.columns))

	return identifier_field_ids_is_equal and schema_is_equal

	def __eq__(self, other: Any) -> bool:
	"""
	Produce a boolean to return True if two objects are considered equal.

	Note:
	Equality of PartitionSpec is determined by spec_id and partition fields only.
	"""
	if not isinstance(other, PartitionSpec):
	return False
	return self.spec_id == other.spec_id and self.fields == other.fields

	class SortOrder(IcebergBaseModel):
	"""Describes how the data is sorted within the table.

	Users can sort their data within partitions by columns to gain performance.

	The order of the sort fields within the list defines the order in which the sort is applied to the data.

	Args:
	fields (List[SortField]): The fields how the table is sorted.

	Keyword Args:
	order_id (int): An unique id of the sort-order of a table.
	"""

	order_id: int = Field(alias="order-id", default=INITIAL_SORT_ORDER_ID)
	fields: List[SortField] = Field(default_factory=list)

	def __init__(self, fields: SortField, *data: Any):
	if fields:
	data["fields"] = fields
	super().__init__(**data)

	@property
	def is_unsorted(self) -> bool:
	return len(self.fields) == 0

	def __str__(self) -> str:
	"""Return the string representation of the SortOrder class."""
	result_str = "["
	if self.fields:
	result_str += "\n " + "\n ".join([str(field) for field in self.fields]) + "\n"
	result_str += "]"
	return result_str

	def __repr__(self) -> str:
	"""Return the string representation of the SortOrder class."""
	fields = f"{', '.join(repr(column) for column in self.fields)}, " if self.fields else ""
	return f"SortOrder({fields}order_id={self.order_id})"

Bug Fix: Rest Catalog update partition spec and sort order when fresh schema is created #392

Bug Fix: Rest Catalog update partition spec and sort order when fresh schema is created #392

Conversation

sungwy commented Feb 7, 2024 • edited Loading

kevinjqliu left a comment

Choose a reason for hiding this comment

sungwy commented Feb 7, 2024

kevinjqliu commented Feb 7, 2024 • edited Loading

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu Feb 7, 2024

Choose a reason for hiding this comment

sungwy Feb 7, 2024

Choose a reason for hiding this comment

kevinjqliu Feb 7, 2024

Choose a reason for hiding this comment

kevinjqliu Feb 7, 2024

Choose a reason for hiding this comment

sungwy Feb 7, 2024

Choose a reason for hiding this comment

sungwy commented Feb 7, 2024

HonahX left a comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

sungwy commented Feb 7, 2024 •

edited

Loading

kevinjqliu commented Feb 7, 2024 •

edited

Loading