Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support S3 Table Buckets with S3TablesCatalog #1429

Open
wants to merge 60 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
0812e33
feat: initial setup for S3TablesCatalog
felixscherz Dec 14, 2024
ba172bb
feat: support create_table using FsspecFileIO
felixscherz Dec 14, 2024
db70192
feat: implement drop_table
felixscherz Dec 14, 2024
324de35
feat: implement drop_namespace
felixscherz Dec 14, 2024
3c8fe2c
test: validate how version conflict is handled with s3tables API
felixscherz Dec 15, 2024
e1772f7
feat: implement commit_table
felixscherz Dec 15, 2024
ec53f20
feat: implement table_exists
felixscherz Dec 18, 2024
d660fce
feat: implement list_tables
felixscherz Dec 18, 2024
5530073
refactor: improve list_namespace
felixscherz Dec 18, 2024
0399a94
fix: return Identifier from list_tables
felixscherz Dec 18, 2024
89764c8
feat: implement rename table
felixscherz Dec 21, 2024
8fd1946
feat: implement load_namespace_properties
felixscherz Dec 21, 2024
a2aae10
refactor: move some methods around
felixscherz Dec 21, 2024
287d536
feat: raise NotImplementedError for views functionality
felixscherz Dec 21, 2024
954d119
feat: raise NotImplementedError for purge_table
felixscherz Dec 21, 2024
f3d2b2f
feat: raise NotImplementedError for update_namespace_properties
felixscherz Dec 21, 2024
202e98b
feat: raise NotImplementedError for register_table
felixscherz Dec 21, 2024
1731fcb
fix: don't override create_table_transaction
felixscherz Dec 21, 2024
088cf0e
chore: run formatter
felixscherz Dec 21, 2024
8bc01a2
feat: raise exceptions if boto3 doesn't support s3tables
felixscherz Dec 23, 2024
9b8f0bd
feat: make endpoint configurable
felixscherz Dec 23, 2024
4973087
feat: explicitly configure tableBucketARN
felixscherz Dec 23, 2024
ce305fe
fix: remove defaulting to FsspecIO
felixscherz Dec 23, 2024
f5bc5cd
feat: raise exceptions for invalid namespace/table name
felixscherz Dec 23, 2024
5666362
feat: improve error handling for create_table
felixscherz Dec 29, 2024
38f907f
feat: improve error handling for delete_table
felixscherz Dec 29, 2024
db4303d
chore: cleanup comments
felixscherz Dec 29, 2024
6b8bfd0
feat: catch missing metadata for load_table
felixscherz Dec 29, 2024
a8ef69f
feat: handle missing namespace and preexisting table
felixscherz Dec 29, 2024
8833fcf
feat: handle versionToken and table in an atomic operation
felixscherz Dec 29, 2024
7491b62
chore: run formatter
felixscherz Dec 29, 2024
4640492
chore: add type hints for tests
felixscherz Dec 29, 2024
1c7aeb7
fix: no longer enforce FsspecFileIO
felixscherz Jan 4, 2025
2bafb1a
test: remove tests for boto3 behavior
felixscherz Jan 4, 2025
0937f3e
test: verify column was created on commit
felixscherz Jan 4, 2025
848bc07
test: verify new data can be committed to table
felixscherz Jan 4, 2025
b18601a
docs: update documentation for create_table
felixscherz Jan 5, 2025
69c1856
test: set AWS regions explicitly
felixscherz Jan 5, 2025
6de777e
Apply suggestions from code review
felixscherz Jan 6, 2025
20f09ef
test: commit new data to table
felixscherz Jan 6, 2025
589df88
feat: clarify update_namespace_properties error
felixscherz Jan 6, 2025
5455202
feat: raise error when setting custom namespace properties
felixscherz Jan 6, 2025
121b19f
refactor: change S3TableCatalog -> S3TablesCatalog
felixscherz Jan 7, 2025
eca2186
feat: raise error on specified table location
felixscherz Jan 7, 2025
42245bf
feat: return empty list when querying a hierarchical namespace
felixscherz Jan 7, 2025
5977c23
refactor: use get_table_metadata_location instead of get_table
felixscherz Jan 7, 2025
2dbac34
refactor: extract 'ICEBERG' table format into constant
felixscherz Jan 7, 2025
ba76c15
feat: change s3tables.table-bucket-arn -> s3tables.warehouse
felixscherz Jan 7, 2025
2ff29a5
Apply suggestions from code review
felixscherz Jan 7, 2025
1afaa8c
feat: add link to naming-rules for invalid name errors
felixscherz Jan 7, 2025
94ce254
feat: delete s3 table if writing new_table_metadata is unsuccessful
felixscherz Jan 7, 2025
61d0e05
chore: run linter
felixscherz Jan 7, 2025
296b14e
test: rename test_s3tables.py -> integration_test_s3tables.py
felixscherz Jan 7, 2025
eb71a1f
fix: add license to files
felixscherz Jan 8, 2025
6eb24c9
fix: raise error when creating a table during a transaction
felixscherz Jan 9, 2025
54b8e87
test: mark create_table_transaction test wiht xfail
felixscherz Jan 9, 2025
e9e2cf7
feat: raise NotImplementedError for view_exists
felixscherz Feb 1, 2025
cf020b3
test: use moto server for s3tables tests
felixscherz Feb 2, 2025
5b0e622
docs: add s3tables catalog
felixscherz Feb 2, 2025
f30b7e6
chore: bump moto library
felixscherz Feb 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions mkdocs/docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -524,6 +524,104 @@ catalog:

<!-- prettier-ignore-end -->

### S3Tables Catalog

The S3Tables Catalog leverages the catalog functionalities of the Amazon S3Tables service and requires an existing S3 Tables Bucket to operate.

To use Amazon S3Tables as your catalog, you can configure pyiceberg using one of the following methods. Additionally, refer to the [AWS documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) on configuring credentials to set up your AWS account credentials locally.

If you intend to use the same credentials for both the S3Tables Catalog and S3 FileIO, you can configure the [`client.*` properties](configuration.md#unified-aws-credentials) to streamline the process.

Note that the S3Tables Catalog manages the underlying table locations internally, which makes it incompatible with S3-like storage systems such as MinIO. If you specify the `s3tables.endpoint`, ensure that the `s3.endpoint` is configured accordingly.

```yaml
catalog:
default:
type: s3tables
warehouse: arn:aws:s3tables:us-east-1:012345678901:bucket/pyiceberg-catalog
```

If you prefer to pass the credentials explicitly to the client instead of relying on environment variables,

```yaml
catalog:
default:
type: s3tables
s3tables.access-key-id: <ACCESS_KEY_ID>
s3tables.secret-access-key: <SECRET_ACCESS_KEY>
s3tables.session-token: <SESSION_TOKEN>
s3tables.region: <REGION_NAME>
s3tables.endpoint: http://localhost:9000
s3.endpoint: http://localhost:9000
```

<!-- prettier-ignore-start -->

!!! Note "Client-specific Properties"
`s3tables.*` properties are for S3TablesCatalog only. If you want to use the same credentials for both S3TablesCatalog and S3 FileIO, you can set the `client.*` properties. See the [Unified AWS Credentials](configuration.md#unified-aws-credentials) section for more details.

<!-- prettier-ignore-end -->

<!-- markdown-link-check-disable -->

| Key | Example | Description |
| -------------------------- | ------------------- | -------------------------------------------------------------------------- |
| s3tables.profile-name | default | Configure the static profile used to access the S3Tables Catalog |
| s3tables.region | us-east-1 | Set the region of the S3Tables Catalog |
| s3tables.access-key-id | admin | Configure the static access key id used to access the S3Tables Catalog |
| s3tables.secret-access-key | password | Configure the static secret access key used to access the S3Tables Catalog |
| s3tables.session-token | AQoDYXdzEJr... | Configure the static session token used to access the S3Tables Catalog |
| s3tables.endpoint | <http://localhost>... | Configure the AWS endpoint |
| s3tables.warehouse | arn:aws:s3tables... | Set the underlying S3 Table Bucket |

<!-- markdown-link-check-enable-->

<!-- prettier-ignore-start -->

!!! warning "Removed Properties"
The properties `profile_name`, `region_name`, `aws_access_key_id`, `aws_secret_access_key`, and `aws_session_token` were deprecated and removed in 0.8.0

<!-- prettier-ignore-end -->

An example usage of the S3Tables Catalog is shown below:

```python
from pyiceberg.catalog.s3tables import S3TablesCatalog
import pyarrow as pa


table_bucket_arn: str = "..."
aws_region: str = "..."

properties = {"s3tables.warehouse": table_bucket_arn, "s3tables.region": aws_region}
catalog = S3TablesCatalog(name="s3tables_catalog", **properties)

database_name = "prod"

catalog.create_namespace(namespace=database_name)

pyarrow_table = pa.Table.from_arrays(
[
pa.array([None, "A", "B", "C"]),
pa.array([1, 2, 3, 4]),
pa.array([True, None, False, True]),
pa.array([None, "A", "B", "C"]),
],
schema=pa.schema(
[
pa.field("foo", pa.large_string(), nullable=True),
pa.field("bar", pa.int32(), nullable=False),
pa.field("baz", pa.bool_(), nullable=True),
pa.field("large", pa.large_string(), nullable=True),
]
),
)

identifier = (database_name, "orders")
table = catalog.create_table(identifier=identifier, schema=pyarrow_table.schema)
table.append(pyarrow_table)
```

### Custom Catalog Implementations

If you want to load any custom catalog implementation, you can set catalog configurations like the following:
Expand Down
9 changes: 5 additions & 4 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

15 changes: 13 additions & 2 deletions pyiceberg/catalog/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@ class CatalogType(Enum):
GLUE = "glue"
DYNAMODB = "dynamodb"
SQL = "sql"
S3TABLES = "s3tables"


def load_rest(name: str, conf: Properties) -> Catalog:
Expand Down Expand Up @@ -162,12 +163,22 @@ def load_sql(name: str, conf: Properties) -> Catalog:
) from exc


def load_s3tables(name: str, conf: Properties) -> Catalog:
try:
from pyiceberg.catalog.s3tables import S3TablesCatalog

return S3TablesCatalog(name, **conf)
except ImportError as exc:
raise NotInstalledError("AWS S3Tables support not installed: pip install 'pyiceberg[s3tables]'") from exc


AVAILABLE_CATALOGS: dict[CatalogType, Callable[[str, Properties], Catalog]] = {
CatalogType.REST: load_rest,
CatalogType.HIVE: load_hive,
CatalogType.GLUE: load_glue,
CatalogType.DYNAMODB: load_dynamodb,
CatalogType.SQL: load_sql,
CatalogType.S3TABLES: load_s3tables,
}


Expand Down Expand Up @@ -914,8 +925,8 @@ def _get_default_warehouse_location(self, database_name: str, table_name: str) -
raise ValueError("No default path is set, please specify a location when creating a table")

@staticmethod
def _write_metadata(metadata: TableMetadata, io: FileIO, metadata_path: str) -> None:
ToOutputFile.table_metadata(metadata, io.new_output(metadata_path))
def _write_metadata(metadata: TableMetadata, io: FileIO, metadata_path: str, overwrite: bool = False) -> None:
ToOutputFile.table_metadata(metadata, io.new_output(metadata_path), overwrite=overwrite)

@staticmethod
def _get_metadata_location(location: str, new_version: int = 0) -> str:
Expand Down
Loading