-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On write operation, cast data to Iceberg Table's pyarrow schema #523
Conversation
24e7da0
to
05e7444
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, I left one comment but we can also defer that to a later PR.
pyiceberg/table/__init__.py
Outdated
_check_schema(self.schema(), other_schema=df.schema) | ||
_check_schema_compatible(self.schema(), other_schema=df.schema) | ||
# the two schemas are compatible so safe to cast | ||
df = df.cast(self.schema().as_arrow()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should _check_schema_compatible
return a bool to indicate if the cast is needed? I'm not sure how costly the cast is. If we go from string
to large_string
then we might rewrite the Arrow buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I like that idea.
_check_schema_compatible
returns a boolean should_cast
.
- If schema is exactly the same, return
False
and skip cast - If schema is "compatible", return
True
and cast - If schema is not "compatible", throws an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was too complicated when _check_schema_compatible
returned a boolean and threw an error.
I ended up doing an extra comparison as Arrow schemas outside and cast only if necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @kevinjqliu
_check_schema(self.schema(), other_schema=df.schema) | ||
_check_schema_compatible(self.schema(), other_schema=df.schema) | ||
# cast if the two schemas are compatible but not equal | ||
if self.schema().as_arrow() != df.schema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It would be good to call as_arrow()
just once in case we need to cast.
Backport to 0.6.1
Backport to 0.6.1
…ma (#559) * Cast data to Iceberg Table's pyarrow schema (#523) Backport to 0.6.1 * use schema_to_pyarrow directly for backporting * remove print in test --------- Co-authored-by: Kevin Liu <[email protected]>
This PR resolves #520 by casting the incoming pyarrow
Table
to the same Schema as the Iceberg Table. This is safe since we first check for schema compatibility._check_schema
->_check_schema_compatible
.Schema.as_arrow()
.