-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deserialize NestedField initial-default and write-default Attributes #1432
Conversation
Ensures that these attributes are correctly applied to the NestedField when reading an Iceberg schema json file.
@@ -149,6 +149,35 @@ def table_schema_simple() -> Schema: | |||
) | |||
|
|||
|
|||
@pytest.fixture(scope="session") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wydt about adding an NestedField to table_schema_simple instead of creating a new Schema altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can make that change and see what happens. I was hesitant because table_schema_simple
seems to be used by lots of tests so wasn't sure of the implications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 if it becomes too tedious, the current test is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were ~25 test failures when changing table_schema_simple
, with some getting into the pyarrow schema conversions which were a bit more than a copy/paste change.
I can probably figure it out, but might take a bit longer until I find time.
@@ -328,8 +328,8 @@ def __init__( | |||
data["type"] = data["type"] if "type" in data else field_type | |||
data["required"] = required | |||
data["doc"] = doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious if we should do the same here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally had that, but the tests seem to pass without it so I just kept it as is. Happy to change if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird, im trying to figure out when this bug occurs and see if its present in other places in the codebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know what you find, I tried digging into Pydantic to understand why it happens, but ran out of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went down a rabbit hole, here's what I learned.
- Using
__init__
with a pydantic model is kind of an anti-pattern; the pydantic model typically handles all the initialization/validation. __init__
here is to provide the ability to initializeNestedField
with positional args, i.e.NestedField(1, "blah")
For backward compatibility, we can't change NestedField
to not take positional args.
But we can make this __init__
(and other __init__
s) more resilient to bugs with something like
def __init__(self, *args, **kwargs):
# implements `__init__` to support initialization with positional arguments
if args:
field_names = list(self.model_fields.keys()) # Gets all field names defined on the model
if len(args) > len(field_names):
raise TypeError(f"Too many positional arguments. Expected at most {len(field_names)}")
kwargs.update(dict(zip(field_names, args))) # Maps positional args to field names
# Let Pydantic handle aliases and validation
super().__init__(**kwargs)
This doesn't check for when using both positional and keyword args, i.e. NestedField(1, field_id=10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could work, my only worry would be that ordinarily minor changes like changing field order in the class definition would break consumers in unexpected ways.
For example, this instantiation works fine:
from pyiceberg.types import NestedField, StringType
>>> NestedField(1, 'test', StringType(), True)
NestedField(field_id=1, name='test', field_type=StringType(), required=True)
If someone (for some reason) changed field order or adds a new field in the middle, for example:
diff --git a/pyiceberg/types.py b/pyiceberg/types.py
index bd0eb7a..f17a3d4 100644
--- a/pyiceberg/types.py
+++ b/pyiceberg/types.py
@@ -304,33 +304,22 @@ class NestedField(IcebergType):
field_id: int = Field(alias="id")
name: str = Field()
- field_type: SerializeAsAny[IcebergType] = Field(alias="type")
required: bool = Field(default=False)
+ field_type: SerializeAsAny[IcebergType] = Field(alias="type")
Then the instantiation breaks:
>>> NestedField(1, 'test', StringType(), True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/pcichons/dev/code/cisco-sbgidm/iceberg-python/pyiceberg/types.py", line 322, in __init__
super().__init__(**kwargs)
File "/Users/pcichons/dev/code/cisco-sbgidm/iceberg-python/.venv/lib/python3.12/site-packages/pydantic/main.py", line 214, in __init__
validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 2 validation errors for NestedField
required
Input should be a valid boolean [type=bool_type, input_value=StringType(), input_type=StringType]
For further information visit https://errors.pydantic.dev/2.10/v/bool_type
field_type
Input should be a valid dictionary or instance of IcebergType [type=model_type, input_value=True, input_type=bool]
For further information visit https://errors.pydantic.dev/2.10/v/model_type
Tests could catch that, but it might get a bit painful to deal with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a bit messy, but it enables positional arguments. Otherwise, you would always need to specify the keyword-arguments which is pretty verbose :) If we decide to change this, that's fine, but probably in a separate PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks for the great catch @paulcichonski |
Ensures that these attributes are correctly applied to the NestedField when reading an Iceberg schema json file.
Ensures that these attributes are correctly applied to the NestedField when reading an Iceberg schema json file.
Fixes #1431