-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support S3 Table Buckets with S3TablesCatalog #1429
base: main
Are you sure you want to change the base?
Conversation
I was able to work around the issue above by using |
Thanks for working on this @felixscherz Feel free to tag me when its ready for review :) |
I think you can now review this PR if you have time @kevinjqliu :) I currently run tests by setting the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, i added a few comments to clarify the catalog behaviors
I'm a little hesitant to merge this in given that we have to run tests against a production S3 endpoint. Maybe we can mock the endpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'd want to add this to the docs https://py.iceberg.apache.org/configuration/#catalogs
I ran the tests locally
And these 3 testa failed, everything else is ✅
|
Thank you for the review! I removed tests related to boto3 and set the AWS region explicitly for the test run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few more comments.
I was able to run the test locally
AWS_REGION=us-east-2 ARN=... poetry run pytest tests/catalog/test_s3tables.py
after making a few local changes
- poetry update boto3
- add
aws_region
fixture - pass aws_region to catalog
Could you update the PR description so others can test this PR out?
try: | ||
self.s3tables = session.client("s3tables", endpoint_url=properties.get(S3TABLES_ENDPOINT)) | ||
except boto3.session.UnknownServiceError as e: | ||
raise S3TablesError("'s3tables' requires boto3>=1.35.74. Current version: {boto3.__version__}.") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running poetry update boto3
will bump boto3 version to 1.35.88
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this is already merged, can you rebase?
#1476
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rebased my branch. I think we need to keep this check however since as per pyproject.toml
the requirement for boto3
is >=1.24.59
. We could bump that to >=1.35.74
but since that is only required for S3 Tables API I'm not sure about forcing everyone to upgrade their boto3 version just to support S3 Tables. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My two cents: I like this check. I do not think we should enforce dependency version upgrade for adding "optional" new service like s3tables, as it may reduce compatibility and cause additional version conflict.
398e2d7
to
05e4dfd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixscherz Thanks for the great contribution! Looking forward to adding this to PyIceberg! I left some comments. Please let me know what you think.
tests/catalog/test_s3tables.py
Outdated
@@ -0,0 +1,227 @@ | |||
import pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we rename this to integration_test_s3tables.py
? We use this naming convention for tests involved real endpoints, like integration_test_glue.py
. I think it will be great to keep a version of testing against real endpoints even after we have moto s3tables available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats a good point, we have integration tests marked for gcs
iceberg-python/tests/io/test_fsspec.py
Line 479 in 551f524
@pytest.mark.gcs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to integration_test_s3tables.py
and also changed the S3 table bucket arn environment variable to AWS_TEST_S3_TABLE_BUCKET_ARN
to be more explicit, similar to glue: https://github.com/apache/iceberg-python/blob/main/tests/conftest.py#L2090-L2092
Tests are now run with
AWS_REGION=us-east-2 AWS_TEST_S3_TABLE_BUCKET_ARN=... poetry run pytest tests/catalog/integration_test_s3tables.py
I'll adjust the PR description. Let me know what you think:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool! i'd also @pytest.mark
for now since we dont want this test to run with the other pytests. for example, the current tests will run with make test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can merge the code with just the integration tests marked, but im on the fence for this one and would like to hear what others think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this the file's name starts with integration_*
instead of test_*
, it won't be collected by make test
, so I think we're good for now even without any marks : ).
I am in favor of having both unit tests and integration tests for service integration like s3tables and glue. Although we would need to manually trigger these integration tests every time, they are still very helpful in case when bugs are not caught by mocked tests and when verifying the release.
try: | ||
self.s3tables = session.client("s3tables", endpoint_url=properties.get(S3TABLES_ENDPOINT)) | ||
except boto3.session.UnknownServiceError as e: | ||
raise S3TablesError("'s3tables' requires boto3>=1.35.74. Current version: {boto3.__version__}.") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My two cents: I like this check. I do not think we should enforce dependency version upgrade for adding "optional" new service like s3tables, as it may reduce compatibility and cause additional version conflict.
def commit_table( | ||
self, table: Table, requirements: Tuple[TableRequirement, ...], updates: Tuple[TableUpdate, ...] | ||
) -> CommitTableResponse: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not find the logic for cases when table not exist, which means create_table_transaction
will not be supported in the current version.
iceberg-python/pyiceberg/catalog/__init__.py
Lines 754 to 765 in e41c428
def create_table_transaction( | |
self, | |
identifier: Union[str, Identifier], | |
schema: Union[Schema, "pa.Schema"], | |
location: Optional[str] = None, | |
partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC, | |
sort_order: SortOrder = UNSORTED_SORT_ORDER, | |
properties: Properties = EMPTY_DICT, | |
) -> CreateTableTransaction: | |
return CreateTableTransaction( | |
self._create_staged_table(identifier, schema, location, partition_spec, sort_order, properties) | |
) |
We do not have to support everything in the initial PR. But it will be good to override create_table_transaction
as "Not Implemented" for the s3tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added exceptions for this case for now along with a test. I will have a look at how to implement this properly
pyiceberg/catalog/s3tables.py
Outdated
try: | ||
self.s3tables.create_table( | ||
tableBucketARN=self.table_bucket_arn, namespace=namespace, name=table_name, format="ICEBERG" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anything goes wrong after this point, I think we should clean up the created s3 table by s3tables' delete_table
endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a try/except to delete the s3 table in case something goes wrong with writing the initial metadata.
S3TABLES_ACCESS_KEY_ID = "s3tables.access-key-id" | ||
S3TABLES_SECRET_ACCESS_KEY = "s3tables.secret-access-key" | ||
S3TABLES_SESSION_TOKEN = "s3tables.session-token" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any case that the s3tables catalog will need a different set of credentials than the underlying FileIO for S3 buckets? It seems we could just re-use the s3.
prefix here to avoid adding a new set of configuration names. I feel after all s3tables still belong to s3 so it is still intuitive. WDYT?
cc @kevinjqliu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel after all s3tables still belong to s3 so it is still intuitive.
I was thinking about this too. i think its a good idea to keep s3 (FileIO) separate from s3tables (Catalog).
its likely that the credentials are the same, but keeping the logical construct separate is important.
maybe we can just support fallback to s3.
and client.
https://github.com/apache/iceberg-python/pull/1429/files#diff-c4084a5be80f826e488a804050557a54d6ecb0307af07ff57028482fb75b7d76R58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using client.
properties should be supported through get_first_property_value
. If we want to support fallbacks to both s3.
and client.
we would need to determine an order. Maybe it's more straightforward to stick with the fallback to client.
like the dynamodb and glue catalog implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's more straightforward to stick with the fallback to client. like the dynamodb and glue catalog implementations?
I agree esp this separates fileio properties from catalog properties. we should treat s3 tables as S3TablesCatalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s3 (FileIO) separate from s3tables (Catalog).
Good point. I agree that we need to reserve the s3tables
for S3TablesCatalog. My concern is that when configuring credentials, users will probably always use client.*
prefix since technically s3tables catalog and s3 file io never use different set of credentials. In this case, do we really need another set of keys to configure credentials for s3tables only? Would love to hear what you think.
Maybe it's more straightforward to stick with the fallback to client. like the dynamodb and glue catalog implementations?
I also agree. Thinking back, bringing s3.
here seems to create more confusion/complexities than value. It is more clear that we stick to use client.*
to represent shared properties across clients (glue, s3, s3tables).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
users will probably always use client.* prefix since technically s3tables catalog and s3 file io never use different set of credentials
yea the only feature of having s3tables.
is the ability to assign a different credential from s3 file io. i think we should allow this use case but can recommend users to just set client.
to take care of both file io and catalog configs. similar to the glue "Client-specific Properties" section https://py.iceberg.apache.org/configuration/#glue-catalog
can you run |
Co-authored-by: Kevin Liu <[email protected]>
Co-authored-by: Honah J. <[email protected]>
894cbc9
to
2e1c383
Compare
Hi, this is in regards to #1404.
I created a first draft of an
S3TablesCatalog
that uses the S3 Table Bucket API for catalog operations.How to run tests
Since moto does not support mocking the S3 Tables API yet (WIP: getmoto/moto#8470) we have to run tests against a live AWS account. To do that, create an S3 Tables Bucket in one of the supported regions and then set the table bucket ARN and AWS Region as environment variables