Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load an iceberg table from aws glue catalog #515

Closed
arookieds opened this issue Mar 11, 2024 · 18 comments
Closed

Unable to load an iceberg table from aws glue catalog #515

arookieds opened this issue Mar 11, 2024 · 18 comments

Comments

@arookieds
Copy link

Question

PyIceberg version: 0.6.0
Python version: 3.11.1

Comments:

  • Iceberg tables are saved in a AWS Glue catalog
  • catalog, list of namespaces and list of tables are retrievable through the catalog api

Hi,

I am facing issues loading iceberg tables from AWS Glue.
The code I am using is as follow:

from opensea.resources.resources import *
import pyiceberg.catalog
    
profile_name = "saml2aws_profile_name"
catalog_name = "catalog name"
table_name = "table name"
aws_region = "aws region"

catalog = pyiceberg.catalog.load_catalog(
    catalog_name, **{"type": "glue", "profile_name": profile_name}
)

print(catalog.list_namespaces())

table = catalog.load_table((catalog_name, table_name))

The code allow me to:

  • list namespaces
  • list tables

But load_table throw the following error:

Traceback (most recent call last):
  File "/path/to/the/project/testing.py", line 15, in <module>
    table = catalog.load_table((catalog_name, table_name))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 473, in load_table
    return self._convert_glue_to_iceberg(self._get_glue_table(database_name=database_name, table_name=table_name))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 296, in _convert_glue_to_iceberg
    metadata = FromInputFile.table_metadata(file)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/serializers.py", line 112, in table_metadata
    with input_file.open() as input_stream:
         ^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 263, in open
    input_file = self._filesystem.open_input_file(self._path)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_fs.pyx", line 780, in pyarrow._fs.FileSystem.open_input_file
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.

I have checked I have the proper accesses, but it wasn't the issue.
I have tried a few other things but they were all unsuccessful.

  • using load_glue, instead of load_catalog
  • providing access_key and secret_key directly in the load_catalog call

The table definition is as follow and was created via Trino:

create table catalog_name.table_name (
          "timestamp" timestamp,
          "type" varchar(20),
          distribution int,
          service int,
          code varchar(20),
          base_id bigint,
          counter_id bigint,
          "category" varchar(50),
          volume double)
        with (
          format = 'PARQUET',
          partitioning = ARRAY['day(timestamp)'],
          location = 's3://s3_bucket/path/to/table/folder/'
        )
@kevinjqliu
Copy link
Contributor

OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.

This seems to be an issue with reading the metadata file.
Specifically, this line

metadata = FromInputFile.table_metadata(file)

What is the metadata_location of the table in the Glue catalog?

@arookieds
Copy link
Author

Glue point to that same file:
Screenshot 2024-03-15 at 08 32 17

I have tried reading this table using PySpark, and it worked.
Nevertheless, PySpark isn't the ideal solution for my case.

@kevinjqliu
Copy link
Contributor

If it works in PySpark, it's probably not the Glue configuration but in pyiceberg.

Can you double-check the AWS settings? Your AWS profile looks like it can access the Glue catalog and read its content. Does it have permission to read the underlying s3 file?

Secondly,

OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.

That S3 path looks fishy to me. Esp the prefix path/to/s3/table/location/metadata/ and no s3://. We can also check if PyArrow FS is parsing the metadata_location correctly

@arookieds
Copy link
Author

Can you double-check the AWS settings? Your AWS profile looks like it can access the Glue catalog and read its content. Does it have permission to read the underlying s3 file?

Yes, the profile I am using can access the underlying files in S3.

That S3 path looks fishy to me. Esp the prefix path/to/s3/table/location/metadata/ and no s3://. We can also check if PyArrow FS is parsing the metadata_location correctly

The path I am using starts, indeed, with s3://.

@kevinjqliu
Copy link
Contributor

The load_table operation is doing a couple of different things.
Let's verify each step.

Getting the "glue table" object, using the _get_glue_table function

catalog = pyiceberg.catalog.load_catalog(
    catalog_name, **{"type": "glue", "profile_name": profile_name}
)

identifier_tuple = catalog.identifier_to_tuple_without_catalog(identifier)
database_name, table_name = catalog.identifier_to_database_and_table(identifier_tuple, NoSuchTableError)
glue_table = catalog._get_glue_table(database_name=database_name, table_name=table_name)
print(glue_table)

Look at glue table metadata location

properties = glue_table["Parameters"]
METADATA_LOCATION = "metadata_location"
metadata_location = properties[METADATA_LOCATION]
print(metadata_location)

Load the metadata file, check the io implementation

io = load_file_io(properties=catalog.properties, location=metadata_location)
print(io)
file = io.new_input(metadata_location)
print(file)
metadata = FromInputFile.table_metadata(file)
print(metadata)

@kevinjqliu
Copy link
Contributor

apache/iceberg#6820

similar sounding issue

@geruh
Copy link
Contributor

geruh commented Apr 1, 2024

Your glue calls look, fine but your S3 calls are the problem. I was able to reproduce the issue by having the incorrect region for my AWS profile at ./aws/config and passing in the region config upon initializing the catalog.

aws_config

[test]
region = us-east-1

catalog init

catalog = pyiceberg.catalog.load_catalog(
    catalog_name, **{"type": "glue", "profile_name": "test", "region_name": "us-west-2"}
)

Which leads to this exception

File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'test/metadata/00000-c0fc4e45-d79d-41a1-ba92-a4122c09171c.metadata.json' in bucket 'test_bucket': AWS Error UNKNOWN (HTTP status 301) during HeadObject operation: No response body.

It looks like when we infer the correct FileIO the PyarrowFs doesn't utilize the aws profile config. Which might be delegating the calls to the default profile instead.

def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem:
if scheme in {"s3", "s3a", "s3n"}:
from pyarrow.fs import S3FileSystem
client_kwargs: Dict[str, Any] = {
"endpoint_override": self.properties.get(S3_ENDPOINT),
"access_key": self.properties.get(S3_ACCESS_KEY_ID),
"secret_key": self.properties.get(S3_SECRET_ACCESS_KEY),
"session_token": self.properties.get(S3_SESSION_TOKEN),
"region": self.properties.get(S3_REGION),
}
if proxy_uri := self.properties.get(S3_PROXY_URI):
client_kwargs["proxy_options"] = proxy_uri
if connect_timeout := self.properties.get(S3_CONNECT_TIMEOUT):
client_kwargs["connect_timeout"] = float(connect_timeout)
return S3FileSystem(**client_kwargs)

We might need to feed the credentials into the session properties before inferring the FileIO in the GlueCatalog, so that we actually use the correct profile when reading from S3. For now you should be able to work around this by ensuring the profiles region is in sync with the config passed into the catalog. Or pass in the s3.region property into the catalog

edit: just saw the message above the fix is also there

@kevinjqliu
Copy link
Contributor

@geruh thanks for the explanation! Would you say this is a bug in how pyiceberg configures S3? I'm not familiar with the AWS profile config. It seems like if a profile config is passed in, we don't want to override other S3 options, such as region in this case.

@geruh
Copy link
Contributor

geruh commented Apr 2, 2024

No Problem!!

This could potentially be a bug if we assume that the catalog and FileIO (S3) share the same aws profile configs. On one side, having a single profile configuration is convenient for the user's boto client, as it allows initializing all AWS clients with the correct credentials. However, on the other hand, we could argue that this configuration should only work at the catalog level, and for filesystems, separate configurations might be required. I'm inclined towards the first option. However, we are using pyarrow's S3FileSystem implementation, which has no concept of a aws profile. Therefore, we will need to initialize these values through boto's session.get_credentials() and pass them to the filesystem.

I'll raise an issue for this

@kevinjqliu
Copy link
Contributor

thank you! should we close this in favor of #570?

@arookieds
Copy link
Author

I have tried both solution, ie:

  • setting the env variable to the proper AWS region
  • providing it within the function call
    But I am always getting the same error:
Traceback (most recent call last):
  File "/path/to/the/project/testing.py", line 15, in <module>
    table = catalog.load_table((catalog_name, table_name))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 473, in load_table
    return self._convert_glue_to_iceberg(self._get_glue_table(database_name=database_name, table_name=table_name))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 296, in _convert_glue_to_iceberg
    metadata = FromInputFile.table_metadata(file)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/serializers.py", line 112, in table_metadata
    with input_file.open() as input_stream:
         ^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 263, in open
    input_file = self._filesystem.open_input_file(self._path)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_fs.pyx", line 780, in pyarrow._fs.FileSystem.open_input_file
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.

@geruh
Copy link
Contributor

geruh commented Apr 15, 2024

Interesting can you run aws sts get-caller-identity in the terminal to ensure the right identity is being used?

you can also, explicitly set the S3FileIO by passing in the s3 configuration properties into the catalog.

    catalog = pyiceberg.catalog.load_catalog(catalog_name,
                                             **{
                                                 "type": "glue",
                                                 "profile": profile_name,
                                                 "s3.access-key-id": "access-key",
                                                 "s3.secret-access-key": "secret-access-key",
                                                 "s3.region": "us-east-1"
                                             })

@hamzaezzi
Copy link

Interesting can you run aws sts get-caller-identity in the terminal to ensure the right identity is being used?

you can also, explicitly set the S3FileIO by passing in the s3 configuration properties into the catalog.

    catalog = pyiceberg.catalog.load_catalog(catalog_name,
                                             **{
                                                 "type": "glue",
                                                 "profile": profile_name,
                                                 "s3.access-key-id": "access-key",
                                                 "s3.secret-access-key": "secret-access-key",
                                                 "s3.region": "us-east-1"
                                             })

this worked for me when i also added the token information for the s3

catalog = load_catalog(
"default",
**{"type": "glue",
"aws_access_key_id": "ASAXXXXXXXXXX",
"aws_secret_access_key": "0VLxnXXXXXXXXXXX",
"aws_session_token": "IQJb3JpXXXXXXXXXXXXXXXXXXXXXXXX",
"s3.access-key-id": "ASAXXXXXXXXXX",
"s3.secret-access-key": "0VLxnXXXXXXXXXXX",
"s3.session-token": "IQJb3JpXXXXXXXXXXXXXXXXXXXXXXXX",
"s3.region": "eu-west-1",
"region_name": "eu-west-1"
},
)

@anatol-ju
Copy link

We have the same problem here. My manager and me tried to get it to work in parallel and both ran into the same error. We assumed it is a permission issue, but even with admin credentials it didn't work. We used access token, tried to set region manually, provided AWS profile name and alternatively the access keys.
No success.

My guess is that it has something to do with the s3fs package used to read the metadata file.

@impproductions
Copy link

impproductions commented Jul 4, 2024

We had the same problem within our Airflow deployment. The easy fix for us would have been to set the default aws credentials through environment variables:

AWS_ACCESS_KEY_ID=<aws region>
AWS_DEFAULT_REGION=<aws access key>
AWS_SECRET_ACCESS_KEY=<aws secret key>

This, however, wasn't feasible because of deployment issues.
Long story short, we ended up with this solution:

glue_catalog_conf = {
    "s3.region": <aws region>,
    "s3.access-key-id": <aws access key>,
    "s3.secret-access-key": <aws secret key>,
    "region_name": <aws region>,
    "aws_access_key_id": <aws access_key>,
    "aws_secret_access_key": <aws secret key>,
}

catalog: GlueCatalog = load_catalog(
    "some_name",
    type="glue",
    **glue_catalog_conf
)

If you come from a google search, please take everything that follows with a grain of salt, because we have no previous experience with either pyiceberg or airflow. Anyway.

We came to this conclusion (that we needed to pass both formats) because it seems that the the boto client initialization expects one format (the second set in the above snippet):

class GlueCatalog(Catalog):
    def __init__(self, name: str, **properties: Any):
        super().__init__(name, **properties)

        session = boto3.Session(
            profile_name=properties.get("profile_name"),
            region_name=properties.get("region_name"),
            botocore_session=properties.get("botocore_session"),
            aws_access_key_id=properties.get("aws_access_key_id"),
            aws_secret_access_key=properties.get("aws_secret_access_key"),
            aws_session_token=properties.get("aws_session_token"),
        )
        self.glue: GlueClient = session.client("glue")

And the same set of properties is passed to the load_file_io pyiceberg function, which, to the extent of our very limited understanding, expects the other format (s3.stuff):

io = load_file_io(properties=self.properties, location=metadata_location)
file = io.new_input(metadata_location)
metadata = FromInputFile.table_metadata(file)
return Table(
    identifier=(self.name, database_name, table_name),
    metadata=metadata,
    metadata_location=metadata_location,
    io=self._load_file_io(metadata.properties, metadata_location),
    catalog=self,
)

We might be completely off base here, of course, and what ultimately convinced us to adopt the above solution is just that it works, while passing either set of credentials without the other wouldn't work for us.

We're using:

aiobotocore==2.13.1
boto3==1.34.51
botocore==1.34.131
[...]
pyiceberg==0.6.1

We're still unclear on whether it's indeed a bug or we're just using the APIs improperly, any help would be appreciated.

Have a nice day!

@kevinjqliu
Copy link
Contributor

@impproductions Thanks for the detailed explanation. Great catch!

Looking through the code, there's indeed an expectation for both AWS credential formats.
s3.access-key-id vs aws_access_key_id
s3.secret-access-key vs aws_secret_access_key

This issue exists for both glue and dynamodb catalogs
https://github.com/search?q=repo%3Aapache%2Ficeberg-python+aws_secret_access_key+path%3A.py+-path%3Atests&type=code

@kevinjqliu
Copy link
Contributor

Opened #892 to track the issue with AWS credential formats

@kevinjqliu
Copy link
Contributor

Fixed in #922

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants