Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize AWS credential names #922

Merged
merged 14 commits into from
Jul 19, 2024
Merged

Conversation

HonahX
Copy link
Contributor

@HonahX HonahX commented Jul 13, 2024

There has been many discussions and concerns over the current behavior of loading AWS credential for glue/dynamo catalog: #892, #515, #570

This PR tries to standardize the property names of AWS credential configs by introducing:

  • New client-specific configs: glue.access-key-id, dynamodb.access-key-id,
  • Unified AWS configurations: client.access-key-id
  • Deprecate old names such as aws_access_key_id.

For example, in the use case of GlueCatalog + S3, the order of config that each client will look for is

Glue

  1. glue.access-key-id
  2. client.access-key-id
  3. aws_access_key_id (deprecated)

s3

  1. s3.access-key-id
  2. client.access-key-id

Note: the choice of names for Unified AWS Configurations

I chose client. for unified configurations because in java:
https://github.com/apache/iceberg/blob/c68abfc9fd3956077b43aba20441f089bb8b93d6/aws/src/main/java/org/apache/iceberg/aws/AwsClientProperties.java#L67
client.region is used for configurations that affect all the clients.

cc @kevinjqliu @jayceslesar @Fokko @syun64

@@ -46,6 +48,10 @@

logger = logging.getLogger(__name__)

AWS_REGION = "client.region"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose client. for unified configurations because in java:
https://github.com/apache/iceberg/blob/c68abfc9fd3956077b43aba20441f089bb8b93d6/aws/src/main/java/org/apache/iceberg/aws/AwsClientProperties.java#L67
client.region is used for configurations that affect all the clients.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, that's a good find 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to highlight this in the PR description!

@jayceslesar
Copy link
Contributor

jayceslesar commented Jul 13, 2024

went to generate the mkdocs and spawned #923 but I think the approach looks good. The only way to make less repeatable would be to have some factory to generate the constant names with a given prefix but that seems like overkill until there are more AWS services involved (if that happens).

I think keeping the s3 prefix also makes sense in the case a user is backing blob storage with MinIO

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HonahX - this looks great, thank you for working on this standardization effort.

It took a minute for me to get up to speed with the frustrations the users have been experiencing with the choices of AWS configurations due to my lack of experience using glue or dynamodb. Given that there are so many issues that are currently open regarding this family of issues, I think this will be a very welcome change for that user base.

I left a few nits, but the approach looks safe and clean enough for us to push for it to be merged into 0.7.0 in conjunction with some tests.

pyiceberg/catalog/__init__.py Outdated Show resolved Hide resolved
pyiceberg/io/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I think this will solve a lot of issues currently around AWS credentials.
I like the current approach with the client-specific config and fallback to the general config.

We should add a few test cases around the issues that have been reported.

I wonder if we can treat the old property names (i.e. aws_access_key_id) as an alias for the fallback config (client.access-key-id) versus fully deprecating it in 0.8.0

@HonahX
Copy link
Contributor Author

HonahX commented Jul 14, 2024

I wonder if we can treat the old property names (i.e. aws_access_key_id) as an alias for the fallback config (client.access-key-id) versus fully deprecating it in 0.8.0

@kevinjqliu Thanks for raising this! I chose to deprecate aws_access_key_id because I would like to take this chance to formalize the property names in pyiceberg (iceberg), to use <service name>. prefix and hyphens to connect words. Another reason is that I think it is better to ultimately having one property name for a specific functionality to avoid confusion.

Please let me know if these reasons for the deprecation make sense. We could also consider postponing the deprecation to a later release if more discussion is needed or extending the deprecation notice period to 0.9.0 or later to give users more time to adjust.

For example, `PYICEBERG_CATALOG__DEFAULT__S3__ACCESS_KEY_ID`, sets `s3.access-key-id` on the `default` catalog.

# Tables
## Tables
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the extra # so that the Table of contents sidebar can be rendered correctly:
Screenshot 2024-07-14 at 02 15 54

aws_session_token=properties.get("aws_session_token"),
profile_name=PropertyUtil.get_first_property_value(properties, DYNAMODB_PROFILE_NAME, DEPRECATED_PROFILE_NAME),
region_name=PropertyUtil.get_first_property_value(properties, DYNAMODB_REGION, AWS_REGION, DEPRECATED_REGION),
botocore_session=properties.get(DEPRECATED_BOTOCORE_SESSION),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be more reasonable to stop exposing the botocore_session configuration:

  1. Unlike other configs which takes string, the botocore_session takes a botocore.Session instance, which is not aligned with the convention of catalog properties to be a Dict[str, str]
  2. I think we should hide this level of details from users.

if property_value := properties.get(property_name):
return property_value
return None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be in a follow-up PR: I am considering whether it is worth moving the PropertyUtil class to a module in utils. Currently, PropertyUtil is used in the table, catalog, and io modules. Given its wide usage and the fact that it is a standalone utility class, I believe it can be extracted from the table module.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We can probably reorganize many of our functions to avoid circular dependencies 🙂

@HonahX HonahX marked this pull request as ready for review July 14, 2024 09:48
@HonahX HonahX linked an issue Jul 14, 2024 that may be closed by this pull request
@@ -117,6 +124,12 @@
ICEBERG_FIELD_OPTIONAL = "iceberg.field.optional"
ICEBERG_FIELD_CURRENT = "iceberg.field.current"

GLUE_PROFILE_NAME = "glue.profile-name"
GLUE_REGION = "glue.region"
GLUE_ACCESS_KEY_ID = "glue.access-key-id"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How common is it to have a separate access-key-id for glue and the table itself? The S3FileIO would not pick up the glue.access-key-id:

"access_key": self.properties.get(S3_ACCESS_KEY_ID),

This way you would need to set both glue.access-key-id (which is used for authenticating against Glue), and then you need to set s3.access-key-id to ensure that it can access the metadata on the bucket. Maybe good to highlight this on the docs, WDYT?

I'm not an AWS expert, but my gut feeling is that normally people rely on AWS_ACCESS_KEY_ID to be picked up for both Glue and S3, but I prefer to be able to put this in the ~/.pyiceberg.yaml as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! I've updated the doc to explicitly indicating that glue.* properties are for Glue Catalog only. I've also added the s3.* properties to the example to make this clear.

but my gut feeling is that normally people rely on AWS_ACCESS_KEY_ID to be picked up for both Glue and S3, but I prefer to be able to put this in the ~/.pyiceberg.yaml as well.

The client.* properties will cover this case. When user set client.* properties, both catalog and FileIO can pick up the credentials. However, due to the limitation of pyarrow's S3FileSystem as noted in #570, we cannot yet support a unified profile-name property.

I added a separate section Unified AWS Credentials for the client.* properties. Please let me know WDYT.

Here are some preview of the updated page:
Screenshot 2024-07-15 at 00 04 10
Screenshot 2024-07-15 at 00 05 18

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I'm good with that now. At some point we should have a bigger conversation across languages to unify this.

@Fokko Fokko added this to the PyIceberg 0.7.0 release milestone Jul 14, 2024
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments, overall LGTM!
Also thanks for making the docs better!!

@@ -46,6 +48,10 @@

logger = logging.getLogger(__name__)

AWS_REGION = "client.region"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to highlight this in the PR description!

"client.secret-access-key": "client.secret-access-key",
"client.region": "client.region",
"client.session-token": "client.session-token",
"aws_access_key_id": "aws_access_key_id",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can make these reuseable. i noticed this is used frequently

        "aws_access_key_id": "aws_access_key_id",
        "aws_secret_access_key": "aws_secret_access_key",
        "aws_session_token": "aws_session_token",
        "region_name": "region_name",
        "profile_name": "profile_name",

Perhaps

deprecated_properties = {
        "aws_access_key_id": "aws_access_key_id",
        "aws_secret_access_key": "aws_secret_access_key",
        "aws_session_token": "aws_session_token",
        "region_name": "region_name",
        "profile_name": "profile_name",
}

session_properties = {
        ...,
        **deprecated_properties
} 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I've extracted deprecated properties and unified configurations (client.*) to conftest.py`

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for working on this in time for the release!! 👍

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks @HonahX for fixing this 🙌

@Fokko Fokko merged commit d0bfb4a into apache:main Jul 19, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Load the proper AWS credential for glue/dynamodb catalog
5 participants