Skip to content
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
216 changes: 216 additions & 0 deletions design-docs/012-python-namespaces
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# Modularizing icechunk-python's namespace structure

## Problem

The top-level python namespace `icechunk` is really polluted, which makes it harder to navigate the library. There's [~80 things](https://github.com/earth-mover/icechunk/blob/main/icechunk-python/python/icechunk/__init__.py) in it. For example, both the ubiquitous class `Repository` and the extremely niche class `ManifestSplitDimCondition` are in the top-level `icechunk` namespace.

## Solution

It would be better practice to split this up into sub-modules. That would improve the user experience in a few ways:

- auto-complete would be more helpful
- the API docs would be easier to navigate (see also [#1095](https://github.com/earth-mover/icechunk/issues/1095))
- the key classes would be the only ones in the top-level namespace (e.g. `Repo`, `Session`, `IcechunkStore`), making it more immediately obvious how icechunk is structured
- the words "store"/"storage" would become a bit less overloaded (e.g. `icechunk.Storage`, `icechunk.s3_store`, and `icechunk.IcechunkStore` are all currently in the same namespace, but mean totally different things)

## Options for namespace groupings

1. **Group by related functionality**

I.e. everything related to storage together, everything related to virtual chunks together...

2. **Group by cloud provider / storage type**

I.e. everything for Azure together, everything for GCS together...

Are there any others?

## User experience for icechunk beginners

We don't want it to be too hard for beginners to find the imports they need to get started. There is therefore a tension between strict modularization and initial ease-of-use.

## Specific proposal with rationale

Here's a specific proposal. The rationale is that opening or creating should be top-level, but otherwise everything else should be in specific modules.

Some opinionated notes about the rationale used:
- This proposal is aggressive: it reduces the number of things in the top-level namespace from 78 to 5.
- This proposal chooses to group by related functionality.
- The location of classes that are only really returned to the user, and rarely directly called by the user (e.g. `Diff`), doesn't really matter.
- Similarly classes with no constructor methods (e.g. `Session`) don't need to be in the top-level namespace.
- Modules with only one member are fine if they add clarity.
- Adding an extra intermediate namespace doesn't make things much harder for users to find. In other words this:

```python
import icechunk
storage = icechunk.s3_storage(bucket="my-bucket", prefix="my-prefix", from_env=True)
repo = icechunk.Repository.create(storage)
```

is not significantly easier than this:

```python
import icechunk
storage = icechunk.storage.s3_storage(bucket="my-bucket", prefix="my-prefix", from_env=True)
repo = icechunk.Repository.create(storage)
```

Also note that this is already valid API:

```python
import icechunk
storage = icechunk.Storage.s3_storage(bucket="my-bucket", prefix="my-prefix", from_env=True)
repo = icechunk.Repository.create(storage)
```

### Top-level `icechunk` module:

- `icechunk.Repository` (also available in `icechunk.repository.Repository`)
- `icechunk.IcechunkStore` (also available in `icechunk.store.IcechunkStore`)
- `icechunk.spec_version`
- `icechunk.print_debug_info`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_logs_filter should probably be at the top level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should initialize_logs also go in the top-level?

- `icechunk.__version__`

### Sub-modules

Existing namespaces, with new members added:

- `credentials`
- `credentials.Credentials`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per this doc, the classes and functions from the existing sub-modules would be exported from the module itself using __all__ variable instead of the top-level icechunk package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, we actually want both: exported from the module itself, and then re-imported from the sub-module into the top-level __init__.py. That way your initial PR doesn't change the user experience, but does define a clearer underlying structure of sub-modules. So your current PR is for step 2 as described at the end of this doc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for now only add the new namespaces to icechunk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently have this:

# __init__.py

from icechunk._icechunk_python import Credentials

__all__ = [
    "Credentials",
]
# icechunk.icechunk_python.pyi

class Credentials
    ...

Which only allows you to import via from icechunk import Credentials.

What we want is this:

# __init__.py

from icechunk.credentials import Credentials

__all__ = [
    "Credentials",
]
# icechunk.credentials.py

from icechunk._icechunk_python import Credentials

__all__ = [
    "Credentials",
]
# icechunk_python.pyi

class Credentials
    ...

This would enable importing via from icechunk.credentials import Credentials (i.e. the end state we want) whilst for now still allowing importing via from icechunk import Credentials (i.e. backwards compatibility), and requires no changes to the part of the python code that interacts with rust (i.e. separates API from implementation).

- `credentials.azure_credentials`
- `credentials.azure_from_env_credentials`
- `credentials.azure_static_credentials`
- `credentials.containers_credentials`
- `credentials.gcs_credentials`
- `credentials.gcs_from_env_credentials`
- `credentials.gcs_refreshable_credentials`
- `credentials.gcs_static_credentials`
- `credentials.s3_anonymous_credentials`
- `credentials.s3_credentials`
- `credentials.s3_from_env_credentials`
- `credentials.s3_refreshable_credentials`
- `credentials.s3_static_credentials`
- `dask`
- `dask.computing_meta`
- `dask.store_dask`
- `distributed`
- `distributed.extract_sessions`
- `distributed.merge_sessions`
- `garbage`
- `garbage.GCSummary`
Copy link

@Anu-Ra-g Anu-Ra-g Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no such module garbage in icechunk!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes you're right. Let's make one then. I'll edit this design doc to move garbage into another section.

- `repository`
- `repository.Repository`
- `session`
- `session.Session`
- `session.ForkSession`
- `storage`
- `storage.S3Options`
- `storage.Storage`
- `storage.StorageConcurrencySettings`
- `storage.StorageRetriesSettings`
- `storage.StorageSettings`
- `storage.AnyObjectStoreConfig`
- `storage.azure_storage`
- `storage.gcs_storage`
- `storage.gcs_store`
- `storage.http_store`
- `storage.in_memory_storage`
- `storage.local_filesystem_storage`
- `storage.local_filesystem_store`
- `storage.r2_storage`
- `storage.s3_storage`
- `storage.s3_store`
- `storage.tigris_storage`
- `store`
- `store.IcechunkStore`
- `xarray`
- `xarray.to_icechunk`

New namespaces:

- `config`
- `config.CachingConfig`
- `config.CompressionConfig`
- `config.CompressionAlgorithm`
- `config.ObjectStoreConfig`
- `config.RepositoryConfig`
- `conflict`
- `conflict.BasicConflictSolver`
- `conflict.Conflict`
- `conflict.ConflictDetector`
- `conflict.ConflictSolver`
- `conflict.ConflictType`
- `conflict.VersionSelection`
- `exceptions`
- `exceptions.ConflictError`
- `exceptions.IcechunkError`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe IcechunkError at the top level

Copy link
Contributor

@dcherian dcherian Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I expect ConflictError to be used more regularly so it should be top-level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either all exceptions should be top-level or none of them

- `exceptions.RebaseFailedError`
- `logging`
- `logging.initialize_logs`,
- `logging.set_logs_filter`,
- `manifests`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these should probably go in the config module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the manifest splitting classes? They also seem like they should be grouped together though... They could actually go even deeper, i.e. icechunk.config.manifests.ManifestConfig

- `manifests.ManifestConfig`
- `manifests.ManifestFileInfo`
- `manifests.ManifestPreloadCondition`
- `manifests.ManifestPreloadConfig`
- `manifests.ManifestSplitCondition`
- `manifests.ManifestSplitDimCondition`
- `manifests.ManifestSplittingConfig`
- `snapshots`
- `snapshots.SnapshotInfo`
- `snapshots.Diff`
- `virtual`
- `virtual.VirtualChunkContainer`
- `virtual.VirtualChunkSpec`

## Outstanding questions

- It's generally a bit confusing that there are multiple ways to make most classes, e.g. `local_filesystem_store` and `Storage.new_local_filesystem`. Do we really need both?
- If we only kept `Storage` we could get rid of a _lot_ of functions in favour of methods...
- Alternatively should base classes that aren't really meant to be called by the user (e.g. `Storage`) be public API at all?
- What's the difference between `storage.local_filesystem_storage` and `storage.local_filesystem_store`?
- Similarly how can we better disambiguate `storage.s3_storage` and `storage.s3_store`?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need better names. Storage is what you pass to create/open repos. s3_store is a way to indicate the target of a VirtualChunkContainer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay yeah that's really confusing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sketched an alternative proposal in #1182

- Is there a similar question for credentials?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Credentials are currently a mess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These deserve their own separate issue. The class hierarchy is complicated so it's not immediately obvious to me how to simplify the API.

- Should cloud-provider-specific classes be further grouped together within `.storage`/`.credentials`?
- Should all configs live together?
- Should `StorageSettings` really be a `StorageConfig`? Should it live with configs or in `storage`?
- Similarly should `S3Options` really be `S3Config`?
- Should summary dataclasses such as `Diff` and `GCSummary` live in the top-level?
- Is `distributed` meant to be a public module at all? Why is it's API defined in a different place? Should it be merged with `dask`?
- Union types like `AnyCredential` are re-exported as top-level API for importing, but it seems like they are only useful for internal type-hinting purposes?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for typing, but not sure if they are needed publicly or only privately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are only needed for typing, and you don't expect users to make great use of them for typing, I think they should be private.


## Implementation steps (including deprecation cycle)

We need to implement this in multiple steps, including a deprecation cycle which maintains backwards-compatibility until the cycle is complete.

1. **Decide on public API structure (i.e. review then merge this design doc)**

2. **Move everything into desired locations whilst keeping all top-level imports**

We should add tests that both import paths work.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please, particularly we need to add tests that verify we are not breaking user code


3. **Change how docs render to display using fine-grained namespaces instead of single top-level namespace**

We can probably configure the API docs to display the submodules first and ignore the top-level imports, even whilst the top-level imports are still available.

4. **Add deprecation warning pointing users to specific fine-grained import**

User experience should look like this:

```python
# works, but emits a warning
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use Icechunk 2 as the moment to make the change and start warning people about deprecation.

from icechunk import ManifestSplitDimCondition
```
```
DeprecationWarning: Importing class ``ManifestSplitDimCondition`` via top-level ``icechunk`` namespace is deprecated. Instead of importing as ``from icechunk import ManifestSplitDimCondition``, please import from the ``manifests`` module from now on, e.g. ``from icechunk.manifests import ManifestSplitDimCondition``.
```
```python
# user changes code to this - now doesn't emit a warning
from icechunk.manifests import ManifestSplitDimCondition
```

While doing this we should also add deprecation warnings to any other API we decide we don't need.

5. **Eventually actually remove top-level import in favour of specific fine-grained import.**

We could potentially even capture and re-emit a `ModuleNotFoundError` with a note that points to the new location.
Loading