-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schema_id
not incremented during schema evolution
#290
Comments
Hi @kevinjqliu. In Pyiceberg, the iceberg-python/pyiceberg/table/__init__.py Line 1871 in a56838d
I think what you observed in test_base.py is because the _commit_table in in-memory catalog uses new_table_metadata instead of update_table_metadata to commit changes.
For example, this test verifies that the iceberg-python/tests/catalog/test_glue.py Lines 526 to 555 in a56838d
The I notice that you've already opened a PR for InMemoryCatalog and another for HiveCatalog. Thank you so much for the contribution! |
Thank you @HonahX In #289 I reimplemented Therefore the fix for Pulling your comment out of the InMemory Catalog PR
|
Somewhat related, I noticed that iceberg-python/pyiceberg/schema.py Lines 104 to 118 in a56838d
In Should I file this as a separate issue? |
I think this is the intended behavior. We consider two schemas equal if they share the same set of fields and identifier fields, as these factors define the table structure. In contrast, the schema-id relates more to when the schema was added to the table. Say if we want to check if two tables have the same structure, we expect Another example: in iceberg-python/pyiceberg/table/__init__.py Lines 1823 to 1831 in a56838d
You're right. For this test, we can remove the |
Thanks for the explanation @HonahX The equality check ( I think it could be a common foot gun to use
Looking at this code, I'd assume it's asserting that the Couple options:
Option (1) is a big refactor and changes the assumption in a lot of places, i.e. in |
@kevinjqliu Thanks for sharing these options. I think (3) is enough here since this is just in test. There are only few places which require checking equality of both fields and schema_id. If you think (2) will be more helpful in the future, you can add one in Do you want to include these in #289 ? |
Option (3) makes sense, I'll look for places where the I'll include the changes in a separate PR since #289 is already doing multiple things. |
Oof, I think thas been fixed already in #470 |
Awesome. Thank you @anupam-saini I have another follow-up PR that adds |
Apache Iceberg version
0.5.0 (latest release)
Please describe the bug 🐞
When updating the schema of an iceberg table (such as adding a column), the
schema_id
should be incremented.schema_id
is incremented during schema evolution in the Java library but not in the Python libraryFrom the Iceberg spec
From the Java unit test
TestTableMetadata.java
In particular, the newly created table schema has an id of
0
orTableMetadata.INITIAL_SCHEMA_ID
(L1503)The evolved schema after calling
updateSchema
updated the table schema id to1
(L1520)In comparison, from the Python unit test
test_base.py
The original table schema id is
0
, but even after callingupdate_schema()...commit()
, the schema id remains0
(L602 & L616)Stacktrace:
In Java, the
schema_id
is incremented during schema evolution. (example1, example2)In Python, this is done using the
assign_fresh_schema_ids
function (example1, example2)However, this function does not increment the schema id. (source)
Note, the
_get_and_increment
function is used to increment the field id.The text was updated successfully, but these errors were encountered: