-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs: Location Provider Documentation #1537
base: main
Are you sure you want to change the base?
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -54,15 +54,18 @@ Iceberg tables support table properties to configure table behavior. | |||||
|
||||||
### Write options | ||||||
|
||||||
| Key | Options | Default | Description | | ||||||
| -------------------------------------- | --------------------------------- | ------- | ------------------------------------------------------------------------------------------- | | ||||||
| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. | | ||||||
| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg | | ||||||
| `write.parquet.row-group-limit` | Number of rows | 1048576 | The upper bound of the number of entries within a single row group | | ||||||
| `write.parquet.page-size-bytes` | Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk | | ||||||
| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | | ||||||
| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | | ||||||
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | | ||||||
| Key | Options | Default | Description | | ||||||
|------------------------------------------|-----------------------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------| | ||||||
| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. | | ||||||
| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg | | ||||||
| `write.parquet.row-group-limit` | Number of rows | 1048576 | The upper bound of the number of entries within a single row group | | ||||||
| `write.parquet.page-size-bytes` | Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk | | ||||||
| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | | ||||||
| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | | ||||||
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | | ||||||
| `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths | | ||||||
| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hyperlinks are weird sometimes, can you make sure that these work as intended There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've checked all hyperlinks on the current version and they work as intended |
||||||
| `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom LocationProvider](configuration.md#loading-a-custom-locationprovider) implementation | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unsure about what There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe similar to the custom catalog There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't love how this looks. I prefer what it is now: I've changed the section linked above to have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (The above screenshot also shows how code/backticks hyperlinks look, I think they're fine. This is now relevant because of #1537 (comment). |
||||||
|
||||||
### Table behavior options | ||||||
|
||||||
|
@@ -195,6 +198,86 @@ PyIceberg uses [S3FileSystem](https://arrow.apache.org/docs/python/generated/pya | |||||
|
||||||
<!-- markdown-link-check-enable--> | ||||||
|
||||||
## Location Providers | ||||||
|
||||||
Iceberg works with the concept of a LocationProvider that determines file paths for a table's data. PyIceberg | ||||||
smaheshwar-pltr marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
introduces a pluggable LocationProvider module; the LocationProvider used may be specified on a per-table basis via | ||||||
table properties. PyIceberg defaults to the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider), | ||||||
which generates file paths that are optimized for object storage. | ||||||
|
||||||
### SimpleLocationProvider | ||||||
|
||||||
The SimpleLocationProvider places file names underneath a `data` directory in the table's storage location. For example, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I realised I was wrong in #1510 (comment) about docs not needing to change when But I think this is fine because the change will be small - it'd just be " There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should also call out that the "base location" is the From the spec, https://iceberg.apache.org/spec/#table-metadata-fields
|
||||||
a non-partitioned table might have a data file with location: | ||||||
|
||||||
```txt | ||||||
s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet | ||||||
``` | ||||||
|
||||||
When data is partitioned, files under a given partition are grouped into a subdirectory, with that partition key | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: maybe also when the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||
and value as the directory name. For example, a table partitioned over a string column `category` might have a data file | ||||||
with location: | ||||||
|
||||||
```txt | ||||||
s3://bucket/ns/table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet | ||||||
``` | ||||||
|
||||||
The SimpleLocationProvider is enabled for a table by explicitly setting its `write.object-storage.enabled` table | ||||||
property to `False`. | ||||||
|
||||||
### ObjectStoreLocationProvider | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a lot of natural duplication between this section and https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout. I've gone less in-depth here though. I was unsure whether to link to this webpage (and if so, how to word it) because there's a lot that's not relevant to us, e.g.
and
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think we should link to that for additional context |
||||||
|
||||||
When several files are stored under the same prefix, cloud object stores such as S3 often [throttle requests on prefixes](https://repost.aws/knowledge-center/http-5xx-errors-s3), | ||||||
resulting in slowdowns. | ||||||
|
||||||
The ObjectStoreLocationProvider counteracts this by injecting deterministic hashes, in the form of binary directories, | ||||||
into file paths, to distribute files across a larger number of object store prefixes. | ||||||
|
||||||
Paths contain partitions just before the file name and a `data` directory beneath the table's location, in a similar | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See #1537 (comment) re |
||||||
manner to the [SimpleLocationProvider](configuration.md#simplelocationprovider). For example, a table partitioned over a string | ||||||
column `category` might have a data file with location: (note the additional binary directories) | ||||||
|
||||||
```txt | ||||||
s3://bucket/ns/table/data/0101/0110/1001/10110010/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet | ||||||
``` | ||||||
|
||||||
The `write.object-storage.enabled` table property determines whether the ObjectStoreLocationProvider is enabled for a | ||||||
table. It is used by default. | ||||||
|
||||||
#### Partition Exclusion | ||||||
|
||||||
When the ObjectStoreLocationProvider is used, the table property `write.object-storage.partitioned-paths`, which | ||||||
defaults to `True`, can be set to `False` as an additional optimization for object stores. This omits partition keys and | ||||||
values from data file paths *entirely* to further reduce key size. With it disabled, the same data file above would | ||||||
instead be written to: (note the absence of `category=orders`) | ||||||
|
||||||
```txt | ||||||
s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet | ||||||
``` | ||||||
Comment on lines
+259
to
+261
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: what about giving an example of this set to True and another one set to False There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have the False case just above ("the same data file above" here) - or do you mean making that more explicit? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think given the new wording above, its clear now :) thanks! |
||||||
|
||||||
### Loading a Custom LocationProvider | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||
|
||||||
Similar to FileIO, a custom LocationProvider may be provided for a table by concretely subclassing the abstract base | ||||||
class [LocationProvider](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider). The | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wanted to link to this, and this works for me locally, but I get the following warning when serving docs locally:
But I get a similar warning elsewhere: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yea i think its fine, as long as the hyperlink works when you run it locally |
||||||
table property `write.py-location-provider.impl` should be set to the fully-qualified name of the custom | ||||||
LocationProvider (i.e. `module.CustomLocationProvider`). Recall that a LocationProvider is configured per-table, | ||||||
permitting different location provision for different tables. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also mention that java uses |
||||||
|
||||||
An example, custom `LocationProvider` implementation is shown below. | ||||||
|
||||||
```py | ||||||
import uuid | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've only shown this import for conciseness. |
||||||
|
||||||
class UUIDLocationProvider(LocationProvider): | ||||||
def __init__(self, table_location: str, table_properties: Properties): | ||||||
super().__init__(table_location, table_properties) | ||||||
|
||||||
def new_data_location(self, data_file_name: str, partition_key: Optional[PartitionKey] = None) -> str: | ||||||
# Can use any custom method to generate a file path given the partitioning information and file name | ||||||
prefix = f"{self.table_location}/{uuid.uuid4()}" | ||||||
return f"{prefix}/{partition_key.to_path()}/{data_file_name}" if partition_key else f"{prefix}/{data_file_name}" | ||||||
``` | ||||||
|
||||||
## Catalogs | ||||||
|
||||||
PyIceberg currently has native catalog type support for REST, SQL, Hive, Glue and DynamoDB. | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,7 +30,12 @@ | |
|
||
|
||
class LocationProvider(ABC): | ||
"""A base class for location providers, that provide data file locations for write tasks.""" | ||
"""A base class for location providers, that provide data file locations for a table's write tasks. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
Args: | ||
table_location (str): The table's base storage location. | ||
table_properties (Properties): The table's properties. | ||
""" | ||
|
||
table_location: str | ||
table_properties: Properties | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Same wording, more or less, as https://iceberg.apache.org/docs/latest/configuration/#write-properties)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should add a warning or something about how this default differs from the java implementation