-
Notifications
You must be signed in to change notification settings - Fork 379
Description
Proposed refactoring or deprecation
Introduce a real storage abstraction layer for Aim’s backend storage, so that the core repository/query layer is not tightly coupled to the current RocksDB-based implementation.
The goal is not to remove RocksDB or force a different default. The goal is to make the storage architecture extensible enough that the core maintainers can continue using and supporting RocksDB, while the community can implement alternative backends where needed.
Motivation
A number of existing issues suggest that the current storage architecture is creating operational and scaling pain, while also making it difficult for users to adopt alternative backends without forking or large internal changes.
Relevant issues include:
- ERROR Too many open files #3389
- IO Error: 'too many open files' when removing many corrupted runs #3224
- Does it support postgresql or other databases? #3356
There are also related requests for storage pluggability on the artifact/object side:
- Support for storing artifacts in Azure Blob Storage #3380
- Support Google Cloud Storage GCS for artifacts #3391
In my own investigation of a Too many open files failure, the problem did not appear to be a simple OS limit issue. One Aim worker had still opened roughly 1000 regular files, most of them .sst files under .aim/meta/chunks/*.
From reading the code, this appears to be tied to the current architecture:
- there is a generic container interface in
aim/storage/container.py - but
Repostill directly imports and instantiatesRocksContainer/RocksUnionContainer - run storage is bound to chunk-local trees
- union read paths enumerate and open chunk DBs
max_open_files=-1is set in the RocksDB container implementation
Taken together, this suggests that RocksDB is not just the default backend, but a core architectural assumption in the current implementation.
Pitch
I would like to propose introducing a proper abstraction boundary above the current low-level container API, at the repository/storage-factory level.
Concretely, this would ideally mean:
Repoand higher-level query/storage paths depend on a backend interface rather than directly on Rocks-specific classes- RocksDB remains a first-class default backend
- users are not forced into the current chunk-local RocksDB layout as the only practical architecture
- the community can implement alternative backends for their own needs without requiring the maintainers to replace the default storage engine
- the same architectural principle can be applied to object/artifact storage as well
This would let the core maintainers keep the current storage model where it works well, while making Aim more adaptable for users whose workloads would benefit from a different backend.
Additional context
Files that seem especially relevant:
aim/storage/container.pyaim/storage/rockscontainer.pyxaim/storage/union.pyxaim/sdk/repo.pyaim/sdk/base_run.pyaim/sdk/index_manager.py
Relevant code observations:
aim/storage/container.pyprovides a generic storage interfaceaim/sdk/repo.pydirectly imports and constructsRocksContainer/RocksUnionContaineraim/sdk/base_run.pybinds run data to chunk-local trees undermeta/chunks/<run>andseqs/chunks/<run>aim/storage/union.pyxenumerates and opens chunk databases for read accessaim/storage/rockscontainer.pyxsetsmax_open_files=-1
I think a real abstraction layer here would be valuable even if no new backend ships immediately, because it would reduce coupling and make future storage work much easier for both maintainers and the community.