Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InMemory Catalog #293

Closed
kevinjqliu opened this issue Jan 22, 2024 · 8 comments
Closed

InMemory Catalog #293

kevinjqliu opened this issue Jan 22, 2024 · 8 comments

Comments

@kevinjqliu
Copy link
Contributor

Feature Request / Improvement

Feature Request: InMemory Catalog implementation for Python Iceberg. For testing, prototyping, and generally playing around with Iceberg.

It would be great to have a catalog implementation in-memory, and possibly write to local file system instead of s3. This would lower the barrier to entry to use the python iceberg library.

There's currently a NoopCatalog which is used for loading StaticTable

There's also a InMemoryCatalog implementation already used for testing. We can use this as the basis for implementation.

Inspired by Trino's Memory connector

@Fokko
Copy link
Contributor

Fokko commented Jan 22, 2024

Forwarding my comment here as well: #289 (comment)

Maybe we should add this to the documentation of the SqlCatalog as well:

def catalog_memory(warehouse: Path) -> Generator[SqlCatalog, None, None]:
    props = {
        "uri": "sqlite+pysqlite:///:memory:",
        "warehouse": f"file://{warehouse}",
    }

def catalog_memory(warehouse: Path) -> Generator[SqlCatalog, None, None]:
props = {
"uri": "sqlite+pysqlite:///:memory:",
"warehouse": f"file://{warehouse}",
}

The warehouse directory is where the files are stored.

@asheeshgarg
Copy link

@Fokko
Is this apache/iceberg#4518 supported as part of pycieberg?

@kevinjqliu
Copy link
Contributor Author

@Fokko
Didn't know the in-memory sqlite option was available! That's awesome.
I was able to read/write using the SqlCatalog. Metadata is saved in memory using sqlite and data is saved on disk.

Pulling out your comment in #289

Thanks for working on this @kevinjqliu. The issues was created a long time ago, before we had the SqlCatalog with sqlite support. Sqlite can also work in memory rendering the InMemoryCatalog obsolete. Having two in-memory implementations in the codebase adds additional complexity in the codebase. My suggestion would be to replace the MemoryCatalog with the SqlCatalog. WDYT?

I agree that we don't need 2 in-memory catalog implementations. Let me see if I can repurpose #289

@kevinjqliu
Copy link
Contributor Author

kevinjqliu commented Jan 22, 2024

@kevinjqliu
Copy link
Contributor Author

@Fokko I took some time to think this over. I understand your concern regarding 2 in-memory catalog implementations. The 2 implementations are just 2 different ways to achieve the same outcome; using Sqlite's in-memory database is an implementation detail.

I'm leaning towards having InMemoryCatalog as an option especially since there is a Java version of it already.

I think there's value in storing the pure Python Object in memory and making it easily accessible. The Sqlite's in-memory database is another level of indirection which can be difficult to reason about the internals of Iceberg.

@Fokko
Copy link
Contributor

Fokko commented Jan 23, 2024

@kevinjqliu Alright, that's fair, I just wanted to make sure that we considered the option before making the InMemory one public 👍

@Fokko
Copy link
Contributor

Fokko commented Feb 29, 2024

I must admit that I'm second-guessing the decision to add another catalog. This mostly comes from the recent discussions on the Java side where catalogs are being removed to avoid further proliferation.

@kevinjqliu
Copy link
Contributor Author

@Fokko thats fair. I'll see if there are features in the PR I can pull out. We can leave the InMemory catalog under tests/ for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants