-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize AWS credential names #922
Conversation
@@ -46,6 +48,10 @@ | |||
|
|||
logger = logging.getLogger(__name__) | |||
|
|||
AWS_REGION = "client.region" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose client.
for unified configurations because in java:
https://github.com/apache/iceberg/blob/c68abfc9fd3956077b43aba20441f089bb8b93d6/aws/src/main/java/org/apache/iceberg/aws/AwsClientProperties.java#L67
client.region
is used for configurations that affect all the clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, that's a good find 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to highlight this in the PR description!
went to generate the mkdocs and spawned #923 but I think the approach looks good. The only way to make less repeatable would be to have some factory to generate the constant names with a given prefix but that seems like overkill until there are more AWS services involved (if that happens). I think keeping the s3 prefix also makes sense in the case a user is backing blob storage with MinIO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @HonahX - this looks great, thank you for working on this standardization effort.
It took a minute for me to get up to speed with the frustrations the users have been experiencing with the choices of AWS configurations due to my lack of experience using glue or dynamodb. Given that there are so many issues that are currently open regarding this family of issues, I think this will be a very welcome change for that user base.
I left a few nits, but the approach looks safe and clean enough for us to push for it to be merged into 0.7.0 in conjunction with some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! I think this will solve a lot of issues currently around AWS credentials.
I like the current approach with the client-specific config and fallback to the general config.
We should add a few test cases around the issues that have been reported.
I wonder if we can treat the old property names (i.e. aws_access_key_id
) as an alias for the fallback config (client.access-key-id
) versus fully deprecating it in 0.8.0
@kevinjqliu Thanks for raising this! I chose to deprecate Please let me know if these reasons for the deprecation make sense. We could also consider postponing the deprecation to a later release if more discussion is needed or extending the deprecation notice period to 0.9.0 or later to give users more time to adjust. |
For example, `PYICEBERG_CATALOG__DEFAULT__S3__ACCESS_KEY_ID`, sets `s3.access-key-id` on the `default` catalog. | ||
|
||
# Tables | ||
## Tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aws_session_token=properties.get("aws_session_token"), | ||
profile_name=PropertyUtil.get_first_property_value(properties, DYNAMODB_PROFILE_NAME, DEPRECATED_PROFILE_NAME), | ||
region_name=PropertyUtil.get_first_property_value(properties, DYNAMODB_REGION, AWS_REGION, DEPRECATED_REGION), | ||
botocore_session=properties.get(DEPRECATED_BOTOCORE_SESSION), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be more reasonable to stop exposing the botocore_session
configuration:
- Unlike other configs which takes string, the botocore_session takes a botocore.Session instance, which is not aligned with the convention of catalog properties to be a
Dict[str, str]
- I think we should hide this level of details from users.
if property_value := properties.get(property_name): | ||
return property_value | ||
return None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be in a follow-up PR: I am considering whether it is worth moving the PropertyUtil
class to a module in utils. Currently, PropertyUtil is used in the table, catalog, and io modules. Given its wide usage and the fact that it is a standalone utility class, I believe it can be extracted from the table module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We can probably reorganize many of our functions to avoid circular dependencies 🙂
@@ -117,6 +124,12 @@ | |||
ICEBERG_FIELD_OPTIONAL = "iceberg.field.optional" | |||
ICEBERG_FIELD_CURRENT = "iceberg.field.current" | |||
|
|||
GLUE_PROFILE_NAME = "glue.profile-name" | |||
GLUE_REGION = "glue.region" | |||
GLUE_ACCESS_KEY_ID = "glue.access-key-id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How common is it to have a separate access-key-id
for glue and the table itself? The S3FileIO would not pick up the glue.access-key-id
:
iceberg-python/pyiceberg/io/pyarrow.py
Line 349 in 3f44dfe
"access_key": self.properties.get(S3_ACCESS_KEY_ID), |
This way you would need to set both glue.access-key-id
(which is used for authenticating against Glue), and then you need to set s3.access-key-id
to ensure that it can access the metadata on the bucket. Maybe good to highlight this on the docs, WDYT?
I'm not an AWS expert, but my gut feeling is that normally people rely on AWS_ACCESS_KEY_ID
to be picked up for both Glue and S3, but I prefer to be able to put this in the ~/.pyiceberg.yaml
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions! I've updated the doc to explicitly indicating that glue.*
properties are for Glue Catalog only. I've also added the s3.*
properties to the example to make this clear.
but my gut feeling is that normally people rely on AWS_ACCESS_KEY_ID to be picked up for both Glue and S3, but I prefer to be able to put this in the ~/.pyiceberg.yaml as well.
The client.*
properties will cover this case. When user set client.*
properties, both catalog and FileIO can pick up the credentials. However, due to the limitation of pyarrow's S3FileSystem as noted in #570, we cannot yet support a unified profile-name
property.
I added a separate section Unified AWS Credentials
for the client.*
properties. Please let me know WDYT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I'm good with that now. At some point we should have a bigger conversation across languages to unify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments, overall LGTM!
Also thanks for making the docs better!!
@@ -46,6 +48,10 @@ | |||
|
|||
logger = logging.getLogger(__name__) | |||
|
|||
AWS_REGION = "client.region" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to highlight this in the PR description!
tests/catalog/test_dynamodb.py
Outdated
"client.secret-access-key": "client.secret-access-key", | ||
"client.region": "client.region", | ||
"client.session-token": "client.session-token", | ||
"aws_access_key_id": "aws_access_key_id", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can make these reuseable. i noticed this is used frequently
"aws_access_key_id": "aws_access_key_id",
"aws_secret_access_key": "aws_secret_access_key",
"aws_session_token": "aws_session_token",
"region_name": "region_name",
"profile_name": "profile_name",
Perhaps
deprecated_properties = {
"aws_access_key_id": "aws_access_key_id",
"aws_secret_access_key": "aws_secret_access_key",
"aws_session_token": "aws_session_token",
"region_name": "region_name",
"profile_name": "profile_name",
}
session_properties = {
...,
**deprecated_properties
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I've extracted deprecated properties and unified configurations (client.*) to
conftest.py`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for working on this in time for the release!! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks @HonahX for fixing this 🙌
There has been many discussions and concerns over the current behavior of loading AWS credential for glue/dynamo catalog: #892, #515, #570
This PR tries to standardize the property names of AWS credential configs by introducing:
client.access-key-id
aws_access_key_id
.For example, in the use case of GlueCatalog + S3, the order of config that each client will look for is
Glue
s3
Note: the choice of names for Unified AWS Configurations
I chose
client.
for unified configurations because in java:https://github.com/apache/iceberg/blob/c68abfc9fd3956077b43aba20441f089bb8b93d6/aws/src/main/java/org/apache/iceberg/aws/AwsClientProperties.java#L67
client.region
is used for configurations that affect all the clients.cc @kevinjqliu @jayceslesar @Fokko @syun64