Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support S3 Table Buckets with S3TablesCatalog #1429

Open
wants to merge 57 commits into
base: main
Choose a base branch
from

Conversation

felixscherz
Copy link
Contributor

@felixscherz felixscherz commented Dec 14, 2024

Hi, this is in regards to #1404.

I created a first draft of an S3TablesCatalog that uses the S3 Table Bucket API for catalog operations.

How to run tests

Since moto does not support mocking the S3 Tables API yet (WIP: getmoto/moto#8470) we have to run tests against a live AWS account. To do that, create an S3 Tables Bucket in one of the supported regions and then set the table bucket ARN and AWS Region as environment variables

AWS_REGION=us-east-2 AWS_TEST_S3_BUCKET_ARN=... poetry run pytest tests/catalog/integration_test_s3tables.py

@felixscherz
Copy link
Contributor Author

I was able to work around the issue above by using FsspecFileIO instead of the default PyarrowFileIO. Using FsspecFileIO the catalog is now able to create new tables.

@kevinjqliu
Copy link
Contributor

Thanks for working on this @felixscherz Feel free to tag me when its ready for review :)

@felixscherz
Copy link
Contributor Author

I think you can now review this PR if you have time @kevinjqliu :)
The biggest issue for now will be that testing is only possible against AWS itself since moto does not support the s3tables API yet. I created an issue on the moto side but have not had the time to implement it myself getmoto/moto#8422.

I currently run tests by setting the ARN env variable to that of an s3 table bucket I created within my personal AWS account:

https://github.com/felixscherz/iceberg-python/blob/feat/s3tables-catalog/tests/catalog/test_s3tables.py#L24-L31

@felixscherz felixscherz marked this pull request as ready for review December 29, 2024 15:51
@felixscherz felixscherz changed the title WIP: feat: support S3 Table Buckets with S3TablesCatalog feat: support S3 Table Buckets with S3TablesCatalog Dec 29, 2024
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, i added a few comments to clarify the catalog behaviors

I'm a little hesitant to merge this in given that we have to run tests against a production S3 endpoint. Maybe we can mock the endpoint?

pyiceberg/catalog/s3tables.py Show resolved Hide resolved
tests/catalog/test_s3tables.py Outdated Show resolved Hide resolved
tests/catalog/test_s3tables.py Outdated Show resolved Hide resolved
tests/catalog/test_s3tables.py Outdated Show resolved Hide resolved
tests/catalog/test_s3tables.py Outdated Show resolved Hide resolved
pyiceberg/catalog/s3tables.py Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd want to add this to the docs https://py.iceberg.apache.org/configuration/#catalogs

pyiceberg/catalog/s3tables.py Outdated Show resolved Hide resolved
pyiceberg/catalog/s3tables.py Show resolved Hide resolved
pyiceberg/catalog/s3tables.py Show resolved Hide resolved
@kevinjqliu
Copy link
Contributor

I ran the tests locally ARN=arn:aws:s3tables:us-east-2:... poetry run pytest tests/catalog/test_s3tables.py
had to manually add s3tables.region to the catalog config

    properties = {"s3tables.table-bucket-arn": table_bucket_arn, "s3tables.region": "us-east-2", "py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"}

And these 3 testa failed, everything else is ✅

FAILED tests/catalog/test_s3tables.py::test_s3tables_api_raises_on_conflicting_version_tokens - botocore.exceptions.NoRegionError: You must specify a region.
FAILED tests/catalog/test_s3tables.py::test_s3tables_api_raises_on_preexisting_table - botocore.exceptions.NoRegionError: You must specify a region.
FAILED tests/catalog/test_s3tables.py::test_creating_catalog_validates_s3_table_bucket_exists - botocore.exceptions.NoRegionError: You must specify a region.

@felixscherz
Copy link
Contributor Author

Thank you for the review!

I removed tests related to boto3 and set the AWS region explicitly for the test run.
I agree with you that we should not merge this as long as the only option for running tests is to run them against a live AWS account. I'm currently working on supporting s3tables with moto: getmoto/moto#8470.
We can hold off on this PR until moto can support s3tables so we can run tests against a mock endpoint?

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few more comments.

I was able to run the test locally

AWS_REGION=us-east-2 ARN=... poetry run pytest tests/catalog/test_s3tables.py

after making a few local changes

  • poetry update boto3
  • add aws_region fixture
  • pass aws_region to catalog

Could you update the PR description so others can test this PR out?

pyiceberg/catalog/s3tables.py Outdated Show resolved Hide resolved
pyiceberg/catalog/s3tables.py Outdated Show resolved Hide resolved
pyiceberg/catalog/s3tables.py Show resolved Hide resolved
pyiceberg/catalog/s3tables.py Show resolved Hide resolved
tests/catalog/test_s3tables.py Outdated Show resolved Hide resolved
tests/catalog/test_s3tables.py Outdated Show resolved Hide resolved
tests/catalog/test_s3tables.py Outdated Show resolved Hide resolved
try:
self.s3tables = session.client("s3tables", endpoint_url=properties.get(S3TABLES_ENDPOINT))
except boto3.session.UnknownServiceError as e:
raise S3TablesError("'s3tables' requires boto3>=1.35.74. Current version: {boto3.__version__}.") from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running poetry update boto3 will bump boto3 version to 1.35.88

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is already merged, can you rebase?
#1476

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased my branch. I think we need to keep this check however since as per pyproject.toml the requirement for boto3 is >=1.24.59. We could bump that to >=1.35.74 but since that is only required for S3 Tables API I'm not sure about forcing everyone to upgrade their boto3 version just to support S3 Tables. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My two cents: I like this check. I do not think we should enforce dependency version upgrade for adding "optional" new service like s3tables, as it may reduce compatibility and cause additional version conflict.

@felixscherz felixscherz force-pushed the feat/s3tables-catalog branch from 398e2d7 to 05e4dfd Compare January 6, 2025 16:27
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixscherz Thanks for the great contribution! Looking forward to adding this to PyIceberg! I left some comments. Please let me know what you think.

@@ -0,0 +1,227 @@
import pytest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we rename this to integration_test_s3tables.py? We use this naming convention for tests involved real endpoints, like integration_test_glue.py. I think it will be great to keep a version of testing against real endpoints even after we have moto s3tables available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats a good point, we have integration tests marked for gcs

@pytest.mark.gcs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to integration_test_s3tables.py and also changed the S3 table bucket arn environment variable to AWS_TEST_S3_TABLE_BUCKET_ARN to be more explicit, similar to glue: https://github.com/apache/iceberg-python/blob/main/tests/conftest.py#L2090-L2092

Tests are now run with

AWS_REGION=us-east-2 AWS_TEST_S3_TABLE_BUCKET_ARN=... poetry run pytest tests/catalog/integration_test_s3tables.py

I'll adjust the PR description. Let me know what you think:)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! i'd also @pytest.mark for now since we dont want this test to run with the other pytests. for example, the current tests will run with make test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can merge the code with just the integration tests marked, but im on the fence for this one and would like to hear what others think

Copy link
Contributor

@HonahX HonahX Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this the file's name starts with integration_* instead of test_*, it won't be collected by make test, so I think we're good for now even without any marks : ).

I am in favor of having both unit tests and integration tests for service integration like s3tables and glue. Although we would need to manually trigger these integration tests every time, they are still very helpful in case when bugs are not caught by mocked tests and when verifying the release.

pyiceberg/catalog/s3tables.py Outdated Show resolved Hide resolved
try:
self.s3tables = session.client("s3tables", endpoint_url=properties.get(S3TABLES_ENDPOINT))
except boto3.session.UnknownServiceError as e:
raise S3TablesError("'s3tables' requires boto3>=1.35.74. Current version: {boto3.__version__}.") from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My two cents: I like this check. I do not think we should enforce dependency version upgrade for adding "optional" new service like s3tables, as it may reduce compatibility and cause additional version conflict.

pyiceberg/catalog/s3tables.py Outdated Show resolved Hide resolved
Comment on lines +71 to +75
def commit_table(
self, table: Table, requirements: Tuple[TableRequirement, ...], updates: Tuple[TableUpdate, ...]
) -> CommitTableResponse:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find the logic for cases when table not exist, which means create_table_transaction will not be supported in the current version.

def create_table_transaction(
self,
identifier: Union[str, Identifier],
schema: Union[Schema, "pa.Schema"],
location: Optional[str] = None,
partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC,
sort_order: SortOrder = UNSORTED_SORT_ORDER,
properties: Properties = EMPTY_DICT,
) -> CreateTableTransaction:
return CreateTableTransaction(
self._create_staged_table(identifier, schema, location, partition_spec, sort_order, properties)
)

We do not have to support everything in the initial PR. But it will be good to override create_table_transaction as "Not Implemented" for the s3tables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added exceptions for this case for now along with a test. I will have a look at how to implement this properly

pyiceberg/catalog/s3tables.py Outdated Show resolved Hide resolved
pyiceberg/catalog/s3tables.py Outdated Show resolved Hide resolved
try:
self.s3tables.create_table(
tableBucketARN=self.table_bucket_arn, namespace=namespace, name=table_name, format="ICEBERG"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything goes wrong after this point, I think we should clean up the created s3 table by s3tables' delete_table endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a try/except to delete the s3 table in case something goes wrong with writing the initial metadata.

pyiceberg/catalog/s3tables.py Outdated Show resolved Hide resolved
S3TABLES_ACCESS_KEY_ID = "s3tables.access-key-id"
S3TABLES_SECRET_ACCESS_KEY = "s3tables.secret-access-key"
S3TABLES_SESSION_TOKEN = "s3tables.session-token"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case that the s3tables catalog will need a different set of credentials than the underlying FileIO for S3 buckets? It seems we could just re-use the s3. prefix here to avoid adding a new set of configuration names. I feel after all s3tables still belong to s3 so it is still intuitive. WDYT?

cc @kevinjqliu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel after all s3tables still belong to s3 so it is still intuitive.

I was thinking about this too. i think its a good idea to keep s3 (FileIO) separate from s3tables (Catalog).
its likely that the credentials are the same, but keeping the logical construct separate is important.

maybe we can just support fallback to s3. and client.
https://github.com/apache/iceberg-python/pull/1429/files#diff-c4084a5be80f826e488a804050557a54d6ecb0307af07ff57028482fb75b7d76R58

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using client. properties should be supported through get_first_property_value. If we want to support fallbacks to both s3. and client. we would need to determine an order. Maybe it's more straightforward to stick with the fallback to client. like the dynamodb and glue catalog implementations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's more straightforward to stick with the fallback to client. like the dynamodb and glue catalog implementations?

I agree esp this separates fileio properties from catalog properties. we should treat s3 tables as S3TablesCatalog

Copy link
Contributor

@HonahX HonahX Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3 (FileIO) separate from s3tables (Catalog).

Good point. I agree that we need to reserve the s3tables for S3TablesCatalog. My concern is that when configuring credentials, users will probably always use client.* prefix since technically s3tables catalog and s3 file io never use different set of credentials. In this case, do we really need another set of keys to configure credentials for s3tables only? Would love to hear what you think.

Maybe it's more straightforward to stick with the fallback to client. like the dynamodb and glue catalog implementations?

I also agree. Thinking back, bringing s3. here seems to create more confusion/complexities than value. It is more clear that we stick to use client.* to represent shared properties across clients (glue, s3, s3tables).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users will probably always use client.* prefix since technically s3tables catalog and s3 file io never use different set of credentials

yea the only feature of having s3tables. is the ability to assign a different credential from s3 file io. i think we should allow this use case but can recommend users to just set client. to take care of both file io and catalog configs. similar to the glue "Client-specific Properties" section https://py.iceberg.apache.org/configuration/#glue-catalog

pyiceberg/catalog/s3tables.py Outdated Show resolved Hide resolved
pyiceberg/catalog/s3tables.py Outdated Show resolved Hide resolved
pyiceberg/catalog/s3tables.py Outdated Show resolved Hide resolved
@kevinjqliu
Copy link
Contributor

can you run poetry lock --no-update for CI?

@felixscherz felixscherz force-pushed the feat/s3tables-catalog branch from 894cbc9 to 2e1c383 Compare January 8, 2025 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants