Deserialize NestedField initial-default and write-default Attributes #1432

paulcichonski · 2024-12-15T19:50:48Z

Ensures that these attributes are correctly applied to the NestedField when reading an Iceberg schema json file.

kevinjqliu · 2024-12-16T15:06:18Z

tests/conftest.py

@@ -149,6 +149,35 @@ def table_schema_simple() -> Schema:
    )


+@pytest.fixture(scope="session")


nit: wydt about adding an NestedField to table_schema_simple instead of creating a new Schema altogether?

Sure, I can make that change and see what happens. I was hesitant because table_schema_simple seems to be used by lots of tests so wasn't sure of the implications.

+1 if it becomes too tedious, the current test is fine

There were ~25 test failures when changing table_schema_simple, with some getting into the pyarrow schema conversions which were a bit more than a copy/paste change.

I can probably figure it out, but might take a bit longer until I find time.

kevinjqliu · 2024-12-16T15:08:22Z

pyiceberg/types.py

@@ -328,8 +328,8 @@ def __init__(
        data["type"] = data["type"] if "type" in data else field_type
        data["required"] = required
        data["doc"] = doc


curious if we should do the same here too

I originally had that, but the tests seem to pass without it so I just kept it as is. Happy to change if you'd like.

weird, im trying to figure out when this bug occurs and see if its present in other places in the codebase

Let me know what you find, I tried digging into Pydantic to understand why it happens, but ran out of time.

I went down a rabbit hole, here's what I learned.

Using __init__ with a pydantic model is kind of an anti-pattern; the pydantic model typically handles all the initialization/validation.

__init__ here is to provide the ability to initialize NestedField with positional args, i.e. NestedField(1, "blah")

For backward compatibility, we can't change NestedField to not take positional args.
But we can make this __init__ (and other __init__s) more resilient to bugs with something like

def __init__(self, *args, **kwargs): # implements `__init__` to support initialization with positional arguments if args: field_names = list(self.model_fields.keys()) # Gets all field names defined on the model if len(args) > len(field_names): raise TypeError(f"Too many positional arguments. Expected at most {len(field_names)}") kwargs.update(dict(zip(field_names, args))) # Maps positional args to field names # Let Pydantic handle aliases and validation super().__init__(**kwargs)

This doesn't check for when using both positional and keyword args, i.e. NestedField(1, field_id=10)

That could work, my only worry would be that ordinarily minor changes like changing field order in the class definition would break consumers in unexpected ways.

For example, this instantiation works fine:

from pyiceberg.types import NestedField, StringType >>> NestedField(1, 'test', StringType(), True) NestedField(field_id=1, name='test', field_type=StringType(), required=True)

If someone (for some reason) changed field order or adds a new field in the middle, for example:

diff --git a/pyiceberg/types.py b/pyiceberg/types.py index bd0eb7a..f17a3d4 100644 --- a/pyiceberg/types.py +++ b/pyiceberg/types.py @@ -304,33 +304,22 @@ class NestedField(IcebergType): field_id: int = Field(alias="id") name: str = Field() - field_type: SerializeAsAny[IcebergType] = Field(alias="type") required: bool = Field(default=False) + field_type: SerializeAsAny[IcebergType] = Field(alias="type")

Then the instantiation breaks:

>>> NestedField(1, 'test', StringType(), True) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/pcichons/dev/code/cisco-sbgidm/iceberg-python/pyiceberg/types.py", line 322, in __init__ super().__init__(**kwargs) File "/Users/pcichons/dev/code/cisco-sbgidm/iceberg-python/.venv/lib/python3.12/site-packages/pydantic/main.py", line 214, in __init__ validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ pydantic_core._pydantic_core.ValidationError: 2 validation errors for NestedField required Input should be a valid boolean [type=bool_type, input_value=StringType(), input_type=StringType] For further information visit https://errors.pydantic.dev/2.10/v/bool_type field_type Input should be a valid dictionary or instance of IcebergType [type=model_type, input_value=True, input_type=bool] For further information visit https://errors.pydantic.dev/2.10/v/model_type

Tests could catch that, but it might get a bit painful to deal with.

Yes, this is a bit messy, but it enables positional arguments. Otherwise, you would always need to specify the keyword-arguments which is pretty verbose :) If we decide to change this, that's fine, but probably in a separate PR :)

kevinjqliu

LGTM!

kevinjqliu · 2024-12-17T17:18:23Z

Thanks for the great catch @paulcichonski

Ensures that these attributes are correctly applied to the NestedField when reading an Iceberg schema json file.

Deserialize initial-default and write-default

6ba2b56

Ensures that these attributes are correctly applied to the NestedField when reading an Iceberg schema json file.

paulcichonski mentioned this pull request Dec 15, 2024

Schema Deserialization Ignores Field initial-default and write-default Values #1431

Closed

3 tasks

kevinjqliu reviewed Dec 16, 2024

View reviewed changes

Fokko approved these changes Dec 17, 2024

View reviewed changes

kevinjqliu approved these changes Dec 17, 2024

View reviewed changes

kevinjqliu merged commit b0ea716 into apache:main Dec 17, 2024
7 checks passed

paulcichonski deleted the fix-nested-types-deser branch December 17, 2024 18:04

sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 24, 2024

Deserialize initial-default and write-default (apache#1432)

a4a1b7a

Ensures that these attributes are correctly applied to the NestedField when reading an Iceberg schema json file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deserialize NestedField initial-default and write-default Attributes #1432

Deserialize NestedField initial-default and write-default Attributes #1432

paulcichonski commented Dec 15, 2024

kevinjqliu Dec 16, 2024

paulcichonski Dec 16, 2024

kevinjqliu Dec 16, 2024

paulcichonski Dec 16, 2024

kevinjqliu Dec 16, 2024

paulcichonski Dec 16, 2024

kevinjqliu Dec 16, 2024

paulcichonski Dec 16, 2024

kevinjqliu Dec 16, 2024 •

edited

Loading

paulcichonski Dec 16, 2024

Fokko Dec 17, 2024

kevinjqliu left a comment

kevinjqliu commented Dec 17, 2024

		@@ -149,6 +149,35 @@ def table_schema_simple() -> Schema:
		)


		@pytest.fixture(scope="session")

Deserialize NestedField initial-default and write-default Attributes #1432

Deserialize NestedField initial-default and write-default Attributes #1432

Conversation

paulcichonski commented Dec 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu commented Dec 17, 2024

kevinjqliu Dec 16, 2024 •

edited

Loading