-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs: Location Provider Documentation #1537
base: main
Are you sure you want to change the base?
Docs: Location Provider Documentation #1537
Conversation
mkdocs/docs/configuration.md
Outdated
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | | ||
| `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths | | ||
| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled | | ||
| `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom LocationProvider](configuration.md#loading-a-custom-locationprovider) implementation | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure about what Options
to put here. Happy to take suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe similar to the custom catalog
https://py.iceberg.apache.org/configuration/#custom-catalog-implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't love how this looks. I prefer what it is now:
I've changed the section linked above to have mymodule.MyLocationProvider
in custom Catalog style. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The above screenshot also shows how code/backticks hyperlinks look, I think they're fine. This is now relevant because of #1537 (comment).
mkdocs/docs/configuration.md
Outdated
|
||
### SimpleLocationProvider | ||
|
||
The SimpleLocationProvider places file names underneath a `data` directory in the table's storage location. For example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realised I was wrong in #1510 (comment) about docs not needing to change when write.data.path
is supported. I think we do want to say that data files are under a data
directory, here and below.
But I think this is fine because the change will be small - it'd just be "write.data.path
" instead of "data
directory under the table's location" throughout. write.data.path
defaults to this.
mkdocs/docs/configuration.md
Outdated
|
||
The SimpleLocationProvider is enabled for a table by explicitly setting its `write.object-storage.enabled` table property to `false`. | ||
|
||
### ObjectStoreLocationProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of natural duplication between this section and https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout. I've gone less in-depth here though.
I was unsure whether to link to this webpage (and if so, how to word it) because there's a lot that's not relevant to us, e.g.
Note, the path resolution logic for ObjectStoreLocationProvider is write.data.path then /data. However, for the older versions up to 0.12.0, the logic is as follows: - before 0.12.0, write.object-storage.path must be set. - at 0.12.0, write.object-storage.path then write.folder-storage.path then /data.
and
Previously provided base64 hash was updated to base2 in order to provide an improved auto-scaling behavior on S3 General Purpose Buckets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should link to that for additional context
mkdocs/docs/configuration.md
Outdated
s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet | ||
``` | ||
|
||
### Loading a Custom LocationProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -30,7 +30,12 @@ | |||
|
|||
|
|||
class LocationProvider(ABC): | |||
"""A base class for location providers, that provide data file locations for write tasks.""" | |||
"""A base class for location providers, that provide data file locations for a table's write tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mkdocs/docs/configuration.md
Outdated
### Loading a Custom LocationProvider | ||
|
||
Similar to FileIO, a custom LocationProvider may be provided for a table by concretely subclassing the abstract base | ||
class [LocationProvider](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider). The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to link to this, and this works for me locally, but I get the following warning when serving docs locally:
INFO - Doc file 'configuration md' contains an unrecognized relative link '.. / reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider', it was left as is. Did you mean ' reference/pyiceberg/table/locations.md#pyiceberg.table.locations.LocationProvider'?
But I get a similar warning elsewhere: Doc file 'SUMMARY.md' contains an unrecognized relative link 'reference/', it was left as is.
. So I think this is fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i think its fine, as long as the hyperlink works when you run it locally
An example, custom `LocationProvider` implementation is shown below. | ||
|
||
```py | ||
import uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only shown this import for conciseness.
mkdocs/docs/configuration.md
Outdated
| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | | ||
| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | | ||
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | | ||
| `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Same wording, more or less, as https://iceberg.apache.org/docs/latest/configuration/#write-properties)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should add a warning or something about how this default differs from the java implementation
mkdocs/docs/configuration.md
Outdated
The ObjectStoreLocationProvider counteracts this by injecting deterministic hashes, in the form of binary directories, | ||
into file paths, to distribute files across a larger number of object store prefixes. | ||
|
||
Paths contain partitions just before the file name and a `data` directory beneath the table's location, in a similar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1537 (comment) re data
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! These docs are great. I've added a few nit comments
mkdocs/docs/configuration.md
Outdated
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | | ||
| `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths | | ||
| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled | | ||
| `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom LocationProvider](configuration.md#loading-a-custom-locationprovider) implementation | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe similar to the custom catalog
https://py.iceberg.apache.org/configuration/#custom-catalog-implementations
mkdocs/docs/configuration.md
Outdated
| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | | ||
| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | | ||
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | | ||
| `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should add a warning or something about how this default differs from the java implementation
mkdocs/docs/configuration.md
Outdated
| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | | ||
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | | ||
| `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths | | ||
| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hyperlinks are weird sometimes, can you make sure that these work as intended
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked all hyperlinks on the current version and they work as intended
mkdocs/docs/configuration.md
Outdated
|
||
### SimpleLocationProvider | ||
|
||
The SimpleLocationProvider places file names underneath a `data` directory in the table's storage location. For example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SimpleLocationProvider places file names underneath a `data` directory in the table's storage location. For example, | |
The `SimpleLocationProvider` places a table's data file names underneath a `data` directory in the table's storage location. For example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also call out that the "base location" is the table.metadata.location
From the spec, https://iceberg.apache.org/spec/#table-metadata-fields
The table's base location. This is used by writers to determine where to store data files, manifest files, and table metadata files.
mkdocs/docs/configuration.md
Outdated
s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet | ||
``` | ||
|
||
When data is partitioned, files under a given partition are grouped into a subdirectory, with that partition key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When data is partitioned, files under a given partition are grouped into a subdirectory, with that partition key | |
When the table is partitioned, files under a given partition are grouped into a subdirectory, with that partition key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe also when the hive-style
partition path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
```txt | ||
s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: what about giving an example of this set to True and another one set to False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the False case just above ("the same data file above" here) - or do you mean making that more explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think given the new wording above, its clear now :) thanks!
mkdocs/docs/configuration.md
Outdated
|
||
The SimpleLocationProvider is enabled for a table by explicitly setting its `write.object-storage.enabled` table property to `false`. | ||
|
||
### ObjectStoreLocationProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should link to that for additional context
mkdocs/docs/configuration.md
Outdated
### Loading a Custom LocationProvider | ||
|
||
Similar to FileIO, a custom LocationProvider may be provided for a table by concretely subclassing the abstract base | ||
class [LocationProvider](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider). The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i think its fine, as long as the hyperlink works when you run it locally
mkdocs/docs/configuration.md
Outdated
class [LocationProvider](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider). The | ||
table property `write.py-location-provider.impl` should be set to the fully-qualified name of the custom | ||
LocationProvider (i.e. `module.CustomLocationProvider`). Recall that a LocationProvider is configured per-table, | ||
permitting different location provision for different tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also mention that java uses write.location-provider.impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the great docs
```txt | ||
s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think given the new wording above, its clear now :) thanks!
(See below for screenshots)
Closes #1510. This is my first time writing docs here! Happy to receive style feedback - I already suspect I've written too much.
cc @kevinjqliu @Fokko