Skip to content

Commit

Permalink
Documentation for Location Providers
Browse files Browse the repository at this point in the history
  • Loading branch information
Sreesh Maheshwar committed Jan 18, 2025
1 parent 59a0b37 commit ef82985
Show file tree
Hide file tree
Showing 2 changed files with 65 additions and 1 deletion.
59 changes: 59 additions & 0 deletions mkdocs/docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ Iceberg tables support table properties to configure table behavior.

### Write options

***TODO:*** Add LocationProvider-related properties here.

| Key | Options | Default | Description |
| -------------------------------------- | --------------------------------- | ------- | ------------------------------------------------------------------------------------------- |
| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. |
Expand Down Expand Up @@ -195,6 +197,63 @@ PyIceberg uses [S3FileSystem](https://arrow.apache.org/docs/python/generated/pya

<!-- markdown-link-check-enable-->

## Location Providers

Iceberg works with the concept of a LocationProvider that determines the file paths for a table's data. PyIceberg
introduces a pluggable LocationProvider module; the LocationProvider used may be specified on a per-table basis via
table properties. PyIceberg defaults to the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider),
which generates file paths that are optimised for object storage.

### SimpleLocationProvider

The SimpleLocationProvider places file names underneath a `data` directory in the table's storage location. For example,
a non-partitioned table might have a data file with location:

```txt
s3://my-bucket/my_table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
```

When data is partitioned, the files under a given partition are grouped into a subdirectory, with that partition key
and value as the directory name. For example, a table partitioned over a string column `category` might have a data file
with location:

```txt
s3://my-bucket/my_table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
```

The SimpleLocationProvider is enabled for a table by explicitly setting its `write.object-storage.enabled` table property to `false`.

### ObjectStoreLocationProvider

When several files are stored under the same prefix, cloud object stores such as S3 often [throttling requests on prefixes](https://repost.aws/knowledge-center/http-5xx-errors-s3),
resulting in slowdowns.

The ObjectStoreLocationProvider counteracts this by injecting deterministic hashes, in the form of binary directories,
into file paths, to distribute files across a larger number of object store prefixes.

Partitions are included in file paths just before the file name, in a similar manner to the [SimpleLocationProvider](configuration.md#simplelocationprovider).
A table partitioned over a string column `category` might have a data file with location: (note the additional binary directories)

```txt
s3://my-bucket/my_table/data/0101/0110/1001/10110010/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
```

The `write.object-storage.enabled` table property determines whether the ObjectStoreLocationProvider is enabled for a
table. It is used by default.

When the ObjectStoreLocationProvider is used, the table property `write.object-storage.partitioned-paths`, which
defaults to `true`, can be set to `false` as an additional optimisation. This omits partition keys and values from data
file paths *entirely* to further reduce key size. With it disabled, the same data file above would instead be written
to: (note the absence of `category=orders`)

```txt
s3://my-bucket/my_table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
```

### Loading a Custom LocationProvider

***TODO***. Maybe link to code reference for LocationProvider?

## Catalogs

PyIceberg currently has native catalog type support for REST, SQL, Hive, Glue and DynamoDB.
Expand Down
7 changes: 6 additions & 1 deletion pyiceberg/table/locations.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,12 @@


class LocationProvider(ABC):
"""A base class for location providers, that provide data file locations for write tasks."""
"""A base class for location providers, that provide data file locations for a table's write tasks.
Args:
table_location (str): The table's base storage location.
table_properties (Properties): The table's properties.
"""

table_location: str
table_properties: Properties
Expand Down

0 comments on commit ef82985

Please sign in to comment.