Description
Apache Iceberg version
main (development)
Please describe the bug 🐞
When initializing the GlueCatalog with a specific AWS profile, everything works as it should with catalog operations. But, we’ve hit a issue when it comes to working with S3 via the PyArrow S3FileSystem. We allow users to specify a profile for initiating a boto connection however, this preference doesn’t carry over to the S3FileSystem. Instead of using the specified AWS profile, we will check the catalog configs for the s3 configs like:s3.access-key-id, s3.region...
. If those aren't passed in, PyArrow's S3Filesystem has it's own strategy of inferring credentials such as:
- the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN environment variables.
- the default profile credentials in your ~/.aws/credentials and ~/.aws/config.
This workflow leads to some inconsistencies. For example, while Glue operations might be using a ux specified profile, S3 operations could end up using a different set of credentials or even a different region from what’s set in the environment variables or the AWS config files. This is seen in issue #515, where one region (like us-west-2) unexpectedly switches to another (like us-east-1), causing a 301 exception.
For example:
- Set up an AWS profile in ~/.aws/config with an incorrect region:
[default]
region = us-east-1
[test]
region = us-west-2
- Initialize the GlueCatalog with the correct region you want to use:
catalog = pyiceberg.catalog.load_catalog(
catalog_name, **{"type": "glue", "profile_name": "test", "region_name": "us-west-2"}
)
- load a table
catalog.load_table("default.test")
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'test/metadata/00000-c0fc4e45-d79d-41a1-ba92-a4122c09171c.metadata.json' in bucket 'test_bucket': AWS Error UNKNOWN (HTTP status 301) during HeadObject operation: No response body.
On one hand, we could argue that this profile configuration should only work at the catalog level, and for filesystems, the user must specify the aforementioned configs like s3.region
. But on the other hand it seems reasonable that the AWS profile config should work uniformly across both the catalog and filesystem levels. This unified approach would certainly simplify configuration management for users. I’m leaning towards this perspective. However, we're currently utilizing PyArrow's S3FileSystem, which doesn't inherently support AWS profiles. This means we'd need to bridge that gap manually.
cc: @HonahX @Fokko @kevinjqliu