Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialize NestedField initial-default and write-default Attributes #1432

Merged
merged 1 commit into from
Dec 17, 2024

Conversation

paulcichonski
Copy link
Contributor

Ensures that these attributes are correctly applied to the NestedField when reading an Iceberg schema json file.

Fixes #1431

Ensures that these attributes are correctly applied to the NestedField when reading an Iceberg schema json file.
@@ -149,6 +149,35 @@ def table_schema_simple() -> Schema:
)


@pytest.fixture(scope="session")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wydt about adding an NestedField to table_schema_simple instead of creating a new Schema altogether?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can make that change and see what happens. I was hesitant because table_schema_simple seems to be used by lots of tests so wasn't sure of the implications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 if it becomes too tedious, the current test is fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were ~25 test failures when changing table_schema_simple, with some getting into the pyarrow schema conversions which were a bit more than a copy/paste change.

I can probably figure it out, but might take a bit longer until I find time.

@@ -328,8 +328,8 @@ def __init__(
data["type"] = data["type"] if "type" in data else field_type
data["required"] = required
data["doc"] = doc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious if we should do the same here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had that, but the tests seem to pass without it so I just kept it as is. Happy to change if you'd like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird, im trying to figure out when this bug occurs and see if its present in other places in the codebase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what you find, I tried digging into Pydantic to understand why it happens, but ran out of time.

Copy link
Contributor

@kevinjqliu kevinjqliu Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went down a rabbit hole, here's what I learned.

  1. Using __init__ with a pydantic model is kind of an anti-pattern; the pydantic model typically handles all the initialization/validation.
  2. __init__ here is to provide the ability to initialize NestedField with positional args, i.e. NestedField(1, "blah")

For backward compatibility, we can't change NestedField to not take positional args.
But we can make this __init__ (and other __init__s) more resilient to bugs with something like

    def __init__(self, *args, **kwargs):
        # implements `__init__` to support initialization with positional arguments
        if args:
            field_names = list(self.model_fields.keys()) # Gets all field names defined on the model
            if len(args) > len(field_names):
                raise TypeError(f"Too many positional arguments. Expected at most {len(field_names)}")
            kwargs.update(dict(zip(field_names, args))) # Maps positional args to field names
        # Let Pydantic handle aliases and validation
        super().__init__(**kwargs)

This doesn't check for when using both positional and keyword args, i.e. NestedField(1, field_id=10)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could work, my only worry would be that ordinarily minor changes like changing field order in the class definition would break consumers in unexpected ways.

For example, this instantiation works fine:

from pyiceberg.types import NestedField, StringType
>>> NestedField(1, 'test', StringType(), True)
NestedField(field_id=1, name='test', field_type=StringType(), required=True)

If someone (for some reason) changed field order or adds a new field in the middle, for example:

diff --git a/pyiceberg/types.py b/pyiceberg/types.py
index bd0eb7a..f17a3d4 100644
--- a/pyiceberg/types.py
+++ b/pyiceberg/types.py
@@ -304,33 +304,22 @@ class NestedField(IcebergType):

     field_id: int = Field(alias="id")
     name: str = Field()
-    field_type: SerializeAsAny[IcebergType] = Field(alias="type")
     required: bool = Field(default=False)
+    field_type: SerializeAsAny[IcebergType] = Field(alias="type")

Then the instantiation breaks:

>>> NestedField(1, 'test', StringType(), True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/pcichons/dev/code/cisco-sbgidm/iceberg-python/pyiceberg/types.py", line 322, in __init__
    super().__init__(**kwargs)
  File "/Users/pcichons/dev/code/cisco-sbgidm/iceberg-python/.venv/lib/python3.12/site-packages/pydantic/main.py", line 214, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 2 validation errors for NestedField
required
  Input should be a valid boolean [type=bool_type, input_value=StringType(), input_type=StringType]
    For further information visit https://errors.pydantic.dev/2.10/v/bool_type
field_type
  Input should be a valid dictionary or instance of IcebergType [type=model_type, input_value=True, input_type=bool]
    For further information visit https://errors.pydantic.dev/2.10/v/model_type

Tests could catch that, but it might get a bit painful to deal with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a bit messy, but it enables positional arguments. Otherwise, you would always need to specify the keyword-arguments which is pretty verbose :) If we decide to change this, that's fine, but probably in a separate PR :)

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kevinjqliu kevinjqliu merged commit b0ea716 into apache:main Dec 17, 2024
7 checks passed
@kevinjqliu
Copy link
Contributor

Thanks for the great catch @paulcichonski

@paulcichonski paulcichonski deleted the fix-nested-types-deser branch December 17, 2024 18:04
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 24, 2024
Ensures that these attributes are correctly applied to the NestedField when reading an Iceberg schema json file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Schema Deserialization Ignores Field initial-default and write-default Values
3 participants