Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catalog fails to load table using the table's identifier #123

Closed
pdames opened this issue Nov 2, 2023 · 4 comments
Closed

Catalog fails to load table using the table's identifier #123

pdames opened this issue Nov 2, 2023 · 4 comments

Comments

@pdames
Copy link
Contributor

pdames commented Nov 2, 2023

Apache Iceberg version

main (development)

Please describe the bug 🐞

Reproduction Steps

  1. Load a catalog named test_catalog.
  2. Create a table named test_namespace.test_table table inside of it.
  3. Provide the following variations of this table's identifier to catalog.load_table, and ensure that the 3rd variation using table.identifier fails due to inclusion of the catalog's name (test_catalog) @ table.identifier[0]:
table = catalog.load_table("test_namespace.test_table")  # string identifier works...
table = catalog.load_table(("test_namespace", "test_table"))  # tuple identifier also works...
catalog.load_table(table.identifier)  # pyiceberg.exceptions.NoSuchTableError: NoSuchTableException: Table does not exist: test_catalog.test_namespace.test_table

A more interesting question that I'd like to open for discussion as part of this issue is the best way to fix it. For example, we could argue that the bug lies with catalog.load_table() for not handling the catalog name as part of the identifier. Alternatively, we could argue that table.identifier shouldn't include the catalog name in the first place.

As a workaround in my code for now, I've created a wrapper method for catalog.load_table(identifier) that scrubs the current catalog's name from the given identifier before forwarding it to catalog.load_table(identifier).

Environment Info
Iceberg REST Catalog running in the tabulario/iceberg-rest Docker container w/ a mock S3 Filesystem:

def catalog_properties() -> Dict[str, str]:
    return {
        "type": "rest",
        "uri": "http://localhost:8181",
        "s3.endpoint": "http://localhost:9000",
        "s3.access-key-id": "admin",
        "s3.secret-access-key": "password",
    }
    
load_catalog(name="test_catalog", **catalog_properties())
@danielcweeks
Copy link
Contributor

Other than the obvious edge case: loading a table from a catalog that supports 4-part namespaces and happens to have a namespace that matchings the first part.

I would propose that we just update load_catalog to check if the first part matches the current catalog (and has at least three parts) and remove it. The java implementation doesn't currently provide access to the identifier, so there's currently no equivalent problem. I believe the table also has reference to the catalog, so we could also check that the catalog is the correct instance of the catalog.

Thoughts @Fokko ?

@Fokko
Copy link
Contributor

Fokko commented Nov 3, 2023

Thanks for raising this @pdames

In hindsight, I think adding the catalog name to the identifier was a bad choice. We tried to mimic the behavior of Java, but I don't see any advantages of having this (since we have a reference to the catalog anyway).

If you want to refresh a table, you could also run:

table = catalog.load_table(("test_namespace", "test_table"))  # tuple identifier also works...
table.refresh()

However, I still think that your example should work as well.

Removing it will break existing behavior, so I think @danielcweeks's suggestion is best. We probably also want to fix this for the renaming/delete/etc scenario:

catalog.rename_table(table.identifier, 'database.new_table_name')
catalog.drop_table(table.identifier)
catalog.purge_table(table.identifier)

@pdames
Copy link
Contributor Author

pdames commented Nov 3, 2023

Thanks for the input @danielcweeks and @Fokko. I'll raise a PR to apply the fix recommended by @danielcweeks for review.

@pdames
Copy link
Contributor Author

pdames commented Dec 6, 2023

Resolved by #150

@pdames pdames closed this as completed Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants