-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to load an iceberg table from aws glue catalog #515
Comments
This seems to be an issue with reading the metadata file. iceberg-python/pyiceberg/catalog/glue.py Line 320 in 781096e
What is the |
If it works in PySpark, it's probably not the Glue configuration but in pyiceberg. Can you double-check the AWS settings? Your AWS profile looks like it can access the Glue catalog and read its content. Does it have permission to read the underlying s3 file? Secondly,
That S3 path looks fishy to me. Esp the prefix |
Yes, the profile I am using can access the underlying files in S3.
The path I am using starts, indeed, with |
The Getting the "glue table" object, using the
Look at glue table metadata location
Load the metadata file, check the io implementation
|
similar sounding issue |
Your glue calls look, fine but your S3 calls are the problem. I was able to reproduce the issue by having the incorrect region for my AWS profile at aws_config
catalog init
Which leads to this exception
It looks like when we infer the correct FileIO the PyarrowFs doesn't utilize the aws profile config. Which might be delegating the calls to the default profile instead. iceberg-python/pyiceberg/io/pyarrow.py Lines 339 to 357 in 7fcdb8d
We might need to feed the credentials into the session properties before inferring the FileIO in the GlueCatalog, so that we actually use the correct profile when reading from S3. For now you should be able to work around this by ensuring the profiles region is in sync with the config passed into the catalog. Or pass in the edit: just saw the message above the fix is also there |
@geruh thanks for the explanation! Would you say this is a bug in how pyiceberg configures S3? I'm not familiar with the AWS profile config. It seems like if a profile config is passed in, we don't want to override other S3 options, such as |
No Problem!! This could potentially be a bug if we assume that the catalog and FileIO (S3) share the same aws profile configs. On one side, having a single profile configuration is convenient for the user's boto client, as it allows initializing all AWS clients with the correct credentials. However, on the other hand, we could argue that this configuration should only work at the catalog level, and for filesystems, separate configurations might be required. I'm inclined towards the first option. However, we are using pyarrow's S3FileSystem implementation, which has no concept of a aws profile. Therefore, we will need to initialize these values through boto's session.get_credentials() and pass them to the filesystem. I'll raise an issue for this |
thank you! should we close this in favor of #570? |
I have tried both solution, ie:
Traceback (most recent call last):
File "/path/to/the/project/testing.py", line 15, in <module>
table = catalog.load_table((catalog_name, table_name))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 473, in load_table
return self._convert_glue_to_iceberg(self._get_glue_table(database_name=database_name, table_name=table_name))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 296, in _convert_glue_to_iceberg
metadata = FromInputFile.table_metadata(file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/serializers.py", line 112, in table_metadata
with input_file.open() as input_stream:
^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 263, in open
input_file = self._filesystem.open_input_file(self._path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_fs.pyx", line 780, in pyarrow._fs.FileSystem.open_input_file
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body. |
Interesting can you run you can also, explicitly set the S3FileIO by passing in the s3 configuration properties into the catalog.
|
this worked for me when i also added the token information for the s3
|
We have the same problem here. My manager and me tried to get it to work in parallel and both ran into the same error. We assumed it is a permission issue, but even with admin credentials it didn't work. We used access token, tried to set region manually, provided AWS profile name and alternatively the access keys. My guess is that it has something to do with the s3fs package used to read the metadata file. |
We had the same problem within our Airflow deployment. The easy fix for us would have been to set the default aws credentials through environment variables: AWS_ACCESS_KEY_ID=<aws region>
AWS_DEFAULT_REGION=<aws access key>
AWS_SECRET_ACCESS_KEY=<aws secret key> This, however, wasn't feasible because of deployment issues. glue_catalog_conf = {
"s3.region": <aws region>,
"s3.access-key-id": <aws access key>,
"s3.secret-access-key": <aws secret key>,
"region_name": <aws region>,
"aws_access_key_id": <aws access_key>,
"aws_secret_access_key": <aws secret key>,
}
catalog: GlueCatalog = load_catalog(
"some_name",
type="glue",
**glue_catalog_conf
) If you come from a google search, please take everything that follows with a grain of salt, because we have no previous experience with either pyiceberg or airflow. Anyway. We came to this conclusion (that we needed to pass both formats) because it seems that the the boto client initialization expects one format (the second set in the above snippet): class GlueCatalog(Catalog):
def __init__(self, name: str, **properties: Any):
super().__init__(name, **properties)
session = boto3.Session(
profile_name=properties.get("profile_name"),
region_name=properties.get("region_name"),
botocore_session=properties.get("botocore_session"),
aws_access_key_id=properties.get("aws_access_key_id"),
aws_secret_access_key=properties.get("aws_secret_access_key"),
aws_session_token=properties.get("aws_session_token"),
)
self.glue: GlueClient = session.client("glue") And the same set of properties is passed to the io = load_file_io(properties=self.properties, location=metadata_location)
file = io.new_input(metadata_location)
metadata = FromInputFile.table_metadata(file)
return Table(
identifier=(self.name, database_name, table_name),
metadata=metadata,
metadata_location=metadata_location,
io=self._load_file_io(metadata.properties, metadata_location),
catalog=self,
) We might be completely off base here, of course, and what ultimately convinced us to adopt the above solution is just that it works, while passing either set of credentials without the other wouldn't work for us. We're using:
We're still unclear on whether it's indeed a bug or we're just using the APIs improperly, any help would be appreciated. Have a nice day! |
@impproductions Thanks for the detailed explanation. Great catch! Looking through the code, there's indeed an expectation for both AWS credential formats. This issue exists for both |
Opened #892 to track the issue with AWS credential formats |
Fixed in #922 |
Question
PyIceberg version: 0.6.0
Python version: 3.11.1
Comments:
Hi,
I am facing issues loading iceberg tables from AWS Glue.
The code I am using is as follow:
The code allow me to:
But load_table throw the following error:
I have checked I have the proper accesses, but it wasn't the issue.
I have tried a few other things but they were all unsuccessful.
The table definition is as follow and was created via Trino:
The text was updated successfully, but these errors were encountered: