-
Notifications
You must be signed in to change notification settings - Fork 54
Design doc for refactoring python API namespaces #1179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
8e2abfc
b51131f
db4671c
21a3ddc
40307ef
bf5429f
063c59d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,216 @@ | ||
| # Modularizing icechunk-python's namespace structure | ||
|
|
||
| ## Problem | ||
|
|
||
| The top-level python namespace `icechunk` is really polluted, which makes it harder to navigate the library. There's [~80 things](https://github.com/earth-mover/icechunk/blob/main/icechunk-python/python/icechunk/__init__.py) in it. For example, both the ubiquitous class `Repository` and the extremely niche class `ManifestSplitDimCondition` are in the top-level `icechunk` namespace. | ||
|
|
||
| ## Solution | ||
|
|
||
| It would be better practice to split this up into sub-modules. That would improve the user experience in a few ways: | ||
|
|
||
| - auto-complete would be more helpful | ||
| - the API docs would be easier to navigate (see also [#1095](https://github.com/earth-mover/icechunk/issues/1095)) | ||
| - the key classes would be the only ones in the top-level namespace (e.g. `Repo`, `Session`, `IcechunkStore`), making it more immediately obvious how icechunk is structured | ||
| - the words "store"/"storage" would become a bit less overloaded (e.g. `icechunk.Storage`, `icechunk.s3_store`, and `icechunk.IcechunkStore` are all currently in the same namespace, but mean totally different things) | ||
|
|
||
| ## Options for namespace groupings | ||
|
|
||
| 1. **Group by related functionality** | ||
|
|
||
| I.e. everything related to storage together, everything related to virtual chunks together... | ||
|
|
||
| 2. **Group by cloud provider / storage type** | ||
|
|
||
| I.e. everything for Azure together, everything for GCS together... | ||
|
|
||
| Are there any others? | ||
|
|
||
| ## User experience for icechunk beginners | ||
|
|
||
| We don't want it to be too hard for beginners to find the imports they need to get started. There is therefore a tension between strict modularization and initial ease-of-use. | ||
|
|
||
| ## Specific proposal with rationale | ||
|
|
||
| Here's a specific proposal. The rationale is that opening or creating should be top-level, but otherwise everything else should be in specific modules. | ||
|
|
||
| Some opinionated notes about the rationale used: | ||
| - This proposal is aggressive: it reduces the number of things in the top-level namespace from 78 to 5. | ||
| - This proposal chooses to group by related functionality. | ||
| - The location of classes that are only really returned to the user, and rarely directly called by the user (e.g. `Diff`), doesn't really matter. | ||
| - Similarly classes with no constructor methods (e.g. `Session`) don't need to be in the top-level namespace. | ||
| - Modules with only one member are fine if they add clarity. | ||
| - Adding an extra intermediate namespace doesn't make things much harder for users to find. In other words this: | ||
|
|
||
| ```python | ||
| import icechunk | ||
| storage = icechunk.s3_storage(bucket="my-bucket", prefix="my-prefix", from_env=True) | ||
| repo = icechunk.Repository.create(storage) | ||
| ``` | ||
|
|
||
| is not significantly easier than this: | ||
|
|
||
| ```python | ||
| import icechunk | ||
| storage = icechunk.storage.s3_storage(bucket="my-bucket", prefix="my-prefix", from_env=True) | ||
| repo = icechunk.Repository.create(storage) | ||
| ``` | ||
|
|
||
| Also note that this is already valid API: | ||
|
|
||
| ```python | ||
| import icechunk | ||
| storage = icechunk.Storage.s3_storage(bucket="my-bucket", prefix="my-prefix", from_env=True) | ||
| repo = icechunk.Repository.create(storage) | ||
| ``` | ||
|
|
||
| ### Top-level `icechunk` module: | ||
|
|
||
| - `icechunk.Repository` (also available in `icechunk.repository.Repository`) | ||
| - `icechunk.IcechunkStore` (also available in `icechunk.store.IcechunkStore`) | ||
| - `icechunk.spec_version` | ||
| - `icechunk.print_debug_info` | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should |
||
| - `icechunk.__version__` | ||
|
|
||
| ### Sub-modules | ||
|
|
||
| Existing namespaces, with new members added: | ||
|
|
||
| - `credentials` | ||
| - `credentials.Credentials` | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As per this doc, the classes and functions from the existing sub-modules would be exported from the module itself using
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For now, we actually want both: exported from the module itself, and then re-imported from the sub-module into the top-level There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So for now only add the new namespaces to
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We currently have this: # __init__.py
from icechunk._icechunk_python import Credentials
__all__ = [
"Credentials",
]# icechunk.icechunk_python.pyi
class Credentials
...Which only allows you to import via What we want is this: # __init__.py
from icechunk.credentials import Credentials
__all__ = [
"Credentials",
]# icechunk.credentials.py
from icechunk._icechunk_python import Credentials
__all__ = [
"Credentials",
]# icechunk_python.pyi
class Credentials
...This would enable importing via |
||
| - `credentials.azure_credentials` | ||
| - `credentials.azure_from_env_credentials` | ||
| - `credentials.azure_static_credentials` | ||
| - `credentials.containers_credentials` | ||
| - `credentials.gcs_credentials` | ||
| - `credentials.gcs_from_env_credentials` | ||
| - `credentials.gcs_refreshable_credentials` | ||
| - `credentials.gcs_static_credentials` | ||
| - `credentials.s3_anonymous_credentials` | ||
| - `credentials.s3_credentials` | ||
| - `credentials.s3_from_env_credentials` | ||
| - `credentials.s3_refreshable_credentials` | ||
| - `credentials.s3_static_credentials` | ||
| - `dask` | ||
| - `dask.computing_meta` | ||
| - `dask.store_dask` | ||
| - `distributed` | ||
| - `distributed.extract_sessions` | ||
| - `distributed.merge_sessions` | ||
| - `garbage` | ||
| - `garbage.GCSummary` | ||
|
||
| - `repository` | ||
| - `repository.Repository` | ||
| - `session` | ||
| - `session.Session` | ||
| - `session.ForkSession` | ||
| - `storage` | ||
| - `storage.S3Options` | ||
| - `storage.Storage` | ||
| - `storage.StorageConcurrencySettings` | ||
| - `storage.StorageRetriesSettings` | ||
| - `storage.StorageSettings` | ||
| - `storage.AnyObjectStoreConfig` | ||
| - `storage.azure_storage` | ||
| - `storage.gcs_storage` | ||
| - `storage.gcs_store` | ||
| - `storage.http_store` | ||
| - `storage.in_memory_storage` | ||
| - `storage.local_filesystem_storage` | ||
| - `storage.local_filesystem_store` | ||
| - `storage.r2_storage` | ||
| - `storage.s3_storage` | ||
| - `storage.s3_store` | ||
| - `storage.tigris_storage` | ||
| - `store` | ||
| - `store.IcechunkStore` | ||
| - `xarray` | ||
| - `xarray.to_icechunk` | ||
TomNicholas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| New namespaces: | ||
|
|
||
| - `config` | ||
| - `config.CachingConfig` | ||
| - `config.CompressionConfig` | ||
| - `config.CompressionAlgorithm` | ||
| - `config.ObjectStoreConfig` | ||
| - `config.RepositoryConfig` | ||
| - `conflict` | ||
| - `conflict.BasicConflictSolver` | ||
| - `conflict.Conflict` | ||
| - `conflict.ConflictDetector` | ||
| - `conflict.ConflictSolver` | ||
| - `conflict.ConflictType` | ||
| - `conflict.VersionSelection` | ||
| - `exceptions` | ||
| - `exceptions.ConflictError` | ||
| - `exceptions.IcechunkError` | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. I expect ConflictError to be used more regularly so it should be top-level
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think either all exceptions should be top-level or none of them |
||
| - `exceptions.RebaseFailedError` | ||
| - `logging` | ||
| - `logging.initialize_logs`, | ||
| - `logging.set_logs_filter`, | ||
| - `manifests` | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All of these should probably go in the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You mean the manifest splitting classes? They also seem like they should be grouped together though... They could actually go even deeper, i.e. |
||
| - `manifests.ManifestConfig` | ||
| - `manifests.ManifestFileInfo` | ||
| - `manifests.ManifestPreloadCondition` | ||
| - `manifests.ManifestPreloadConfig` | ||
| - `manifests.ManifestSplitCondition` | ||
| - `manifests.ManifestSplitDimCondition` | ||
| - `manifests.ManifestSplittingConfig` | ||
| - `snapshots` | ||
| - `snapshots.SnapshotInfo` | ||
| - `snapshots.Diff` | ||
| - `virtual` | ||
| - `virtual.VirtualChunkContainer` | ||
| - `virtual.VirtualChunkSpec` | ||
|
|
||
| ## Outstanding questions | ||
|
|
||
| - It's generally a bit confusing that there are multiple ways to make most classes, e.g. `local_filesystem_store` and `Storage.new_local_filesystem`. Do we really need both? | ||
| - If we only kept `Storage` we could get rid of a _lot_ of functions in favour of methods... | ||
TomNicholas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - Alternatively should base classes that aren't really meant to be called by the user (e.g. `Storage`) be public API at all? | ||
TomNicholas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - What's the difference between `storage.local_filesystem_storage` and `storage.local_filesystem_store`? | ||
| - Similarly how can we better disambiguate `storage.s3_storage` and `storage.s3_store`? | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we probably need better names.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay yeah that's really confusing
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I sketched an alternative proposal in #1182 |
||
| - Is there a similar question for credentials? | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Credentials are currently a mess
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These deserve their own separate issue. The class hierarchy is complicated so it's not immediately obvious to me how to simplify the API. |
||
| - Should cloud-provider-specific classes be further grouped together within `.storage`/`.credentials`? | ||
| - Should all configs live together? | ||
| - Should `StorageSettings` really be a `StorageConfig`? Should it live with configs or in `storage`? | ||
| - Similarly should `S3Options` really be `S3Config`? | ||
| - Should summary dataclasses such as `Diff` and `GCSummary` live in the top-level? | ||
| - Is `distributed` meant to be a public module at all? Why is it's API defined in a different place? Should it be merged with `dask`? | ||
| - Union types like `AnyCredential` are re-exported as top-level API for importing, but it seems like they are only useful for internal type-hinting purposes? | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Only for typing, but not sure if they are needed publicly or only privately
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If they are only needed for typing, and you don't expect users to make great use of them for typing, I think they should be private. |
||
|
|
||
| ## Implementation steps (including deprecation cycle) | ||
|
|
||
| We need to implement this in multiple steps, including a deprecation cycle which maintains backwards-compatibility until the cycle is complete. | ||
|
|
||
| 1. **Decide on public API structure (i.e. review then merge this design doc)** | ||
|
|
||
| 2. **Move everything into desired locations whilst keeping all top-level imports** | ||
|
|
||
| We should add tests that both import paths work. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes please, particularly we need to add tests that verify we are not breaking user code |
||
|
|
||
| 3. **Change how docs render to display using fine-grained namespaces instead of single top-level namespace** | ||
|
|
||
| We can probably configure the API docs to display the submodules first and ignore the top-level imports, even whilst the top-level imports are still available. | ||
|
|
||
| 4. **Add deprecation warning pointing users to specific fine-grained import** | ||
|
|
||
| User experience should look like this: | ||
|
|
||
| ```python | ||
| # works, but emits a warning | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should use Icechunk 2 as the moment to make the change and start warning people about deprecation. |
||
| from icechunk import ManifestSplitDimCondition | ||
| ``` | ||
| ``` | ||
| DeprecationWarning: Importing class ``ManifestSplitDimCondition`` via top-level ``icechunk`` namespace is deprecated. Instead of importing as ``from icechunk import ManifestSplitDimCondition``, please import from the ``manifests`` module from now on, e.g. ``from icechunk.manifests import ManifestSplitDimCondition``. | ||
| ``` | ||
| ```python | ||
| # user changes code to this - now doesn't emit a warning | ||
| from icechunk.manifests import ManifestSplitDimCondition | ||
| ``` | ||
|
|
||
| While doing this we should also add deprecation warnings to any other API we decide we don't need. | ||
|
|
||
| 5. **Eventually actually remove top-level import in favour of specific fine-grained import.** | ||
|
|
||
| We could potentially even capture and re-emit a `ModuleNotFoundError` with a note that points to the new location. | ||
Uh oh!
There was an error while loading. Please reload this page.