Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialize NestedField initial-default and write-default Attributes #1432

Merged
merged 1 commit into from
Dec 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions pyiceberg/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -328,8 +328,8 @@ def __init__(
data["type"] = data["type"] if "type" in data else field_type
data["required"] = required
data["doc"] = doc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious if we should do the same here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had that, but the tests seem to pass without it so I just kept it as is. Happy to change if you'd like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird, im trying to figure out when this bug occurs and see if its present in other places in the codebase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what you find, I tried digging into Pydantic to understand why it happens, but ran out of time.

Copy link
Contributor

@kevinjqliu kevinjqliu Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went down a rabbit hole, here's what I learned.

  1. Using __init__ with a pydantic model is kind of an anti-pattern; the pydantic model typically handles all the initialization/validation.
  2. __init__ here is to provide the ability to initialize NestedField with positional args, i.e. NestedField(1, "blah")

For backward compatibility, we can't change NestedField to not take positional args.
But we can make this __init__ (and other __init__s) more resilient to bugs with something like

    def __init__(self, *args, **kwargs):
        # implements `__init__` to support initialization with positional arguments
        if args:
            field_names = list(self.model_fields.keys()) # Gets all field names defined on the model
            if len(args) > len(field_names):
                raise TypeError(f"Too many positional arguments. Expected at most {len(field_names)}")
            kwargs.update(dict(zip(field_names, args))) # Maps positional args to field names
        # Let Pydantic handle aliases and validation
        super().__init__(**kwargs)

This doesn't check for when using both positional and keyword args, i.e. NestedField(1, field_id=10)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could work, my only worry would be that ordinarily minor changes like changing field order in the class definition would break consumers in unexpected ways.

For example, this instantiation works fine:

from pyiceberg.types import NestedField, StringType
>>> NestedField(1, 'test', StringType(), True)
NestedField(field_id=1, name='test', field_type=StringType(), required=True)

If someone (for some reason) changed field order or adds a new field in the middle, for example:

diff --git a/pyiceberg/types.py b/pyiceberg/types.py
index bd0eb7a..f17a3d4 100644
--- a/pyiceberg/types.py
+++ b/pyiceberg/types.py
@@ -304,33 +304,22 @@ class NestedField(IcebergType):

     field_id: int = Field(alias="id")
     name: str = Field()
-    field_type: SerializeAsAny[IcebergType] = Field(alias="type")
     required: bool = Field(default=False)
+    field_type: SerializeAsAny[IcebergType] = Field(alias="type")

Then the instantiation breaks:

>>> NestedField(1, 'test', StringType(), True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/pcichons/dev/code/cisco-sbgidm/iceberg-python/pyiceberg/types.py", line 322, in __init__
    super().__init__(**kwargs)
  File "/Users/pcichons/dev/code/cisco-sbgidm/iceberg-python/.venv/lib/python3.12/site-packages/pydantic/main.py", line 214, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 2 validation errors for NestedField
required
  Input should be a valid boolean [type=bool_type, input_value=StringType(), input_type=StringType]
    For further information visit https://errors.pydantic.dev/2.10/v/bool_type
field_type
  Input should be a valid dictionary or instance of IcebergType [type=model_type, input_value=True, input_type=bool]
    For further information visit https://errors.pydantic.dev/2.10/v/model_type

Tests could catch that, but it might get a bit painful to deal with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a bit messy, but it enables positional arguments. Otherwise, you would always need to specify the keyword-arguments which is pretty verbose :) If we decide to change this, that's fine, but probably in a separate PR :)

data["initial-default"] = initial_default
data["write-default"] = write_default
data["initial-default"] = data["initial-default"] if "initial-default" in data else initial_default
data["write-default"] = data["write-default"] if "write-default" in data else write_default
super().__init__(**data)

def __str__(self) -> str:
Expand Down
29 changes: 29 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,35 @@ def table_schema_simple() -> Schema:
)


@pytest.fixture(scope="session")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wydt about adding an NestedField to table_schema_simple instead of creating a new Schema altogether?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can make that change and see what happens. I was hesitant because table_schema_simple seems to be used by lots of tests so wasn't sure of the implications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 if it becomes too tedious, the current test is fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were ~25 test failures when changing table_schema_simple, with some getting into the pyarrow schema conversions which were a bit more than a copy/paste change.

I can probably figure it out, but might take a bit longer until I find time.

def table_schema_with_full_nested_fields() -> Schema:
return schema.Schema(
NestedField(
field_id=1,
name="foo",
field_type=StringType(),
required=False,
doc="foo doc",
initial_default="foo initial",
write_default="foo write",
),
NestedField(
field_id=2, name="bar", field_type=IntegerType(), required=True, doc="bar doc", initial_default=42, write_default=43
),
NestedField(
field_id=3,
name="baz",
field_type=BooleanType(),
required=False,
doc="baz doc",
initial_default=True,
write_default=False,
),
schema_id=1,
identifier_field_ids=[2],
)


@pytest.fixture(scope="session")
def table_schema_nested() -> Schema:
return schema.Schema(
Expand Down
12 changes: 6 additions & 6 deletions tests/test_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -421,17 +421,17 @@ def __getitem__(self, pos: int) -> Any:
assert inner_accessor.get(container) == "name"


def test_serialize_schema(table_schema_simple: Schema) -> None:
actual = table_schema_simple.model_dump_json()
expected = """{"type":"struct","fields":[{"id":1,"name":"foo","type":"string","required":false},{"id":2,"name":"bar","type":"int","required":true},{"id":3,"name":"baz","type":"boolean","required":false}],"schema-id":1,"identifier-field-ids":[2]}"""
def test_serialize_schema(table_schema_with_full_nested_fields: Schema) -> None:
actual = table_schema_with_full_nested_fields.model_dump_json()
expected = """{"type":"struct","fields":[{"id":1,"name":"foo","type":"string","required":false,"doc":"foo doc","initial-default":"foo initial","write-default":"foo write"},{"id":2,"name":"bar","type":"int","required":true,"doc":"bar doc","initial-default":42,"write-default":43},{"id":3,"name":"baz","type":"boolean","required":false,"doc":"baz doc","initial-default":true,"write-default":false}],"schema-id":1,"identifier-field-ids":[2]}"""
assert actual == expected


def test_deserialize_schema(table_schema_simple: Schema) -> None:
def test_deserialize_schema(table_schema_with_full_nested_fields: Schema) -> None:
actual = Schema.model_validate_json(
"""{"type": "struct", "fields": [{"id": 1, "name": "foo", "type": "string", "required": false}, {"id": 2, "name": "bar", "type": "int", "required": true}, {"id": 3, "name": "baz", "type": "boolean", "required": false}], "schema-id": 1, "identifier-field-ids": [2]}"""
"""{"type": "struct", "fields": [{"id": 1, "name": "foo", "type": "string", "required": false, "doc": "foo doc", "initial-default": "foo initial", "write-default": "foo write"}, {"id": 2, "name": "bar", "type": "int", "required": true, "doc": "bar doc", "initial-default": 42, "write-default": 43}, {"id": 3, "name": "baz", "type": "boolean", "required": false, "doc": "baz doc", "initial-default": true, "write-default": false}], "schema-id": 1, "identifier-field-ids": [2]}"""
)
expected = table_schema_simple
expected = table_schema_with_full_nested_fields
assert actual == expected


Expand Down