-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate Redundant Identifier Support in TableIdentifier, and row_filter #994
Conversation
@sungwy Awesome work! Do you know if these changes could impact the multi-level namespace case? I believe it would be nice for us to have some integration tests with Polaris. I think it’s the only catalog that implements the nested namespace. |
Hi @ndrluis - thank you for bringing up this point. I'm of the opinion that having a 'representative' implementation of the REST catalog would be good enough to serve this purpose. Currently we are using a tabular rest catalog image for our tests, but I think it would be a good idea to explore other alternatives. I've created a new test to validate the behavior against the REST catalog in our integration test suite, but I believe we also have tests using multilevel namespaces in |
@@ -613,6 +613,11 @@ def update_namespace_properties( | |||
ValueError: If removals and updates have overlapping keys. | |||
""" | |||
|
|||
@deprecated( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deprecates the public function
@@ -627,6 +632,25 @@ def identifier_to_tuple_without_catalog(self, identifier: Union[str, Identifier] | |||
identifier_tuple = identifier_tuple[1:] | |||
return identifier_tuple | |||
|
|||
def _identifier_to_tuple_without_catalog(self, identifier: Union[str, Identifier]) -> Identifier: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where as this is now called by the PyIceberg functions and only prints the deprecation warning message if we are using the unsupported naming convention
I think Polaris is using the same concept of namespace as we do in PyIceberg. Here's my mental model name[0] == catalog name (catalog1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this change! I think it'll be beneficial to the overall project going forward.
Having a deprecation message for an entire minor version (0.8.0 -> 0.9.0) would give enough time to users to migrate.
|
||
|
||
@pytest.mark.integration | ||
def test_table_v1_with_null_nested_namespace(session_catalog: Catalog, arrow_table_with_null: pa.Table) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this test for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for testing against nested / multilevel namespaces, it was a helpful review suggestion from @ndrluis
help_message="Parsing expressions with table name is deprecated. Only provide field names in the row_filter.", | ||
) | ||
# TODO: Once this is removed, we will no longer take just the last index of parsed column result | ||
# And introduce support for parsing filter expressions with nested fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of this change. We should have done it a long time ago 👍 Thanks for doing this @sungwy
self.metadata = fresh.metadata | ||
self.io = fresh.io | ||
self.metadata_location = fresh.metadata_location | ||
return self | ||
|
||
@property | ||
def identifier(self) -> Identifier: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
identifier = "default.lower.table_v1_with_null_nested_namespace" | ||
tbl = _create_table(session_catalog, identifier, {"format-version": "1"}, [arrow_table_with_null]) | ||
assert tbl.format_version == 1, f"Expected v1, got: v{tbl.format_version}" | ||
print(session_catalog.list_tables("default")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this print before merging
Merged! 🙌 Thank you for the reviews @ndrluis @kevinjqliu and @Fokko ! |
Today in PyIceberg, we have support for identifier parsing in public APIs
belonging to two different classes:
These APIs currently have optional support for the identifier that the
instance itself belongs to.
For example, the catalog class APIs support:
But it also supports:
catalog.load_table(“animals.cats.whiskers”)
Which is redundant, because the catalog.name is already “animals”.
Similarly, row_filter in the Table scan API supports:
But we also support
table.scan(row_filter=”whiskers.n_legs == 4”)
Which is also redundant, because the table name is already “whiskers” (or
cats.whiskers)
The benefits of this change are as follows:
is redundant
leads to issues like: Rest Catalog:
catalog.name
should not be part of namespace #742Rest Catalog:
catalog.name
should not be part of namespace #742 It would be reallygreat to clean this up before as we prepare for a 1.0.0 later this year
correctness issues if there exists a name in the level below as the
instance-level identifier.
named “animals.lower” catalog.load(“animals.lower.cats”) can be construed
as table name “cats” in the namespace “animals.lower” but it will be
interpreted as table name “cats” in the namespace “lower” which is
erroneous.
Field name parsing is already complicated because we have to represent
nested fields as flat representations. So it would be great to remove one
unnecessary level of complication here
We had opened the mail list thread in order to discuss the potential impact of this change, but it didn't get much traction. Based on the very few responses we got from the Slack channel, there is support in making this change.
Note: This PR takes a much more conservative approach than the previous PR (#963) by marking the usage of these redundant identifiers with a deprecation message. In 0.8.0 this usage pattern will log DeprecationWarnings, and in 0.9.0 we will plan to remove these and clean up the impacted functions.
Mail list Discussion: https://lists.apache.org/thread/9zr19hxnbt3hg7lt55t6dfg6otv7zjz2