Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyArrow S3FileSystem doesn't honor the AWS profile config #570

Closed
geruh opened this issue Apr 2, 2024 · 3 comments
Closed

PyArrow S3FileSystem doesn't honor the AWS profile config #570

geruh opened this issue Apr 2, 2024 · 3 comments

Comments

@geruh
Copy link
Contributor

geruh commented Apr 2, 2024

Apache Iceberg version

main (development)

Please describe the bug 🐞

When initializing the GlueCatalog with a specific AWS profile, everything works as it should with catalog operations. But, we’ve hit a issue when it comes to working with S3 via the PyArrow S3FileSystem. We allow users to specify a profile for initiating a boto connection however, this preference doesn’t carry over to the S3FileSystem. Instead of using the specified AWS profile, we will check the catalog configs for the s3 configs like:s3.access-key-id, s3.region... . If those aren't passed in, PyArrow's S3Filesystem has it's own strategy of inferring credentials such as:

  1. the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN environment variables.
  2. the default profile credentials in your ~/.aws/credentials and ~/.aws/config.

This workflow leads to some inconsistencies. For example, while Glue operations might be using a ux specified profile, S3 operations could end up using a different set of credentials or even a different region from what’s set in the environment variables or the AWS config files. This is seen in issue #515, where one region (like us-west-2) unexpectedly switches to another (like us-east-1), causing a 301 exception.

For example:

  1. Set up an AWS profile in ~/.aws/config with an incorrect region:
[default]
region = us-east-1

[test]
region = us-west-2
  1. Initialize the GlueCatalog with the correct region you want to use:
catalog = pyiceberg.catalog.load_catalog(
    catalog_name, **{"type": "glue", "profile_name": "test", "region_name": "us-west-2"}
)
  1. load a table
catalog.load_table("default.test")
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'test/metadata/00000-c0fc4e45-d79d-41a1-ba92-a4122c09171c.metadata.json' in bucket 'test_bucket': AWS Error UNKNOWN (HTTP status 301) during HeadObject operation: No response body.

On one hand, we could argue that this profile configuration should only work at the catalog level, and for filesystems, the user must specify the aforementioned configs like s3.region. But on the other hand it seems reasonable that the AWS profile config should work uniformly across both the catalog and filesystem levels. This unified approach would certainly simplify configuration management for users. I’m leaning towards this perspective. However, we're currently utilizing PyArrow's S3FileSystem, which doesn't inherently support AWS profiles. This means we'd need to bridge that gap manually.

cc: @HonahX @Fokko @kevinjqliu

@HonahX
Copy link
Contributor

HonahX commented Apr 3, 2024

@geruh, thanks for highlighting this issue. The confusion largely stems from the naming convention used when the profile_name, region_name, aws_access_key_id, etc., were introduced in #7781. Initially, these configurations were intended solely for GlueCatalog, but their generic names suggest they might influence both Glue and S3 operations. To address this, we can consider renaming these configurations with a glue. prefix (e.g., glue.profile_name) to clarify their scope. However, to maintain API compatibility, we may need to support both the new and old naming conventions temporarily.

But on the other hand it seems reasonable that the AWS profile config should work uniformly across both the catalog and filesystem levels.

+1 for unified configurations. I think it may be convenient to introduce other unified configurations, with generic names like aws-access-key-id. So the overall order of config will be:

  1. Client-specific configs: glue.access-key-id, s3.access-key-id, etc.
  2. Unified AWS configurations like aws-access-key-id
  3. Environment variables and the default AWS config

However, we're currently utilizing PyArrow's S3FileSystem, which doesn't inherently support AWS profiles. This means we'd need to bridge that gap manually.

Regarding the profile_name support for PyArrow's S3FileSystem, it seems there might not be a direct solution from the pyiceberg side. This functionality appears to be more suitably addressed through enhancements to the PyArrow library itself. WDYT?

@kevinjqliu
Copy link
Contributor

I think it makes sense to have both a "catalog level" configuration and a "file level" configuration. A catalog might have a different set of permissions from when reading specific tables or files.

I like the idea of having specific configurations at each level and also a generic "fall back" configuration.

@kevinjqliu
Copy link
Contributor

Fixed in #922
Glue and S3 can have separate configs as well as using the same unified config

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants