Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the InMemory Catalog Implementation #289

Merged
merged 24 commits into from
Mar 13, 2024

Conversation

kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Jan 20, 2024

Issue #293

Improve the InMemory Catalog implementation.

In this PR:

  • Implement In-Memory catalog’s create_table and _commit_table function
  • Added default warehouse location which can be on the local file system (defaults to /tmp/warehouse)
  • Change test_base.py test_console.py to write to a temporary file location on the local file system using tmp_path from pytest
  • Fix test_commit_table from tests/catalog/test_base.py, issue described in schema_id not incremented during schema evolution  #290

@kevinjqliu kevinjqliu force-pushed the kevinliu/test branch 3 times, most recently from 9242e69 to c09e7e9 Compare January 22, 2024 00:32
@kevinjqliu kevinjqliu changed the title Kevinliu/test InMemory Catalog Implementation Jan 22, 2024
from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
from pyiceberg.typedef import EMPTY_DICT

DEFAULT_WAREHOUSE_LOCATION = "file:///tmp/warehouse"
Copy link
Contributor Author

@kevinjqliu kevinjqliu Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default, write on disk to /tmp/warehouse

super().__init__(name, **properties)
self.__tables = {}
self.__namespaces = {}
self._warehouse_location = properties.get(WAREHOUSE, None) or DEFAULT_WAREHOUSE_LOCATION
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can pass a warehouse location using properties. warehouse location can be another fs such as s3

@pytest.fixture
def catalog() -> InMemoryCatalog:
return InMemoryCatalog("test.in.memory.catalog", **{"test.key": "test.value"})
def catalog(tmp_path: PosixPath) -> InMemoryCatalog:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added ability to write to temporary files for testing, which is then automatically cleaned up

@kevinjqliu kevinjqliu force-pushed the kevinliu/test branch 2 times, most recently from 4842d91 to a56838d Compare January 22, 2024 05:06
@@ -585,8 +397,10 @@ def test_commit_table(catalog: InMemoryCatalog) -> None:

# Then
assert response.metadata.table_uuid == given_table.metadata.table_uuid
assert len(response.metadata.schemas) == 1
assert response.metadata.schemas[0] == new_schema
assert given_table.metadata.current_schema_id == 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinjqliu kevinjqliu marked this pull request as ready for review January 22, 2024 05:15
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! @kevinjqliu I left some initial comments below.

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
tests/catalog/test_base.py Outdated Show resolved Hide resolved
pyiceberg/catalog/in_memory.py Outdated Show resolved Hide resolved
@Fokko
Copy link
Contributor

Fokko commented Jan 22, 2024

Thanks for working on this @kevinjqliu. The issues was created a long time ago, before we had the SqlCatalog with sqlite support. Sqlite can also work in memory rendering the InMemoryCatalog obsolete. Having two in-memory implementations in the codebase adds additional complexity in the codebase. My suggestion would be to replace the MemoryCatalog with the SqlCatalog. WDYT?

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @kevinjqliu

Should we also add this catalog to the tests in tests/integration/test_reads.py?

AVAILABLE_CATALOGS: dict[CatalogType, Callable[[str, Properties], Catalog]] = {
CatalogType.REST: load_rest,
CatalogType.HIVE: load_hive,
CatalogType.GLUE: load_glue,
CatalogType.DYNAMODB: load_dynamodb,
CatalogType.SQL: load_sql,
CatalogType.MEMORY: load_memory,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add this one to the docs: https://py.iceberg.apache.org/configuration/ With a warning that this is just for testing purposes only.

if identifier in self.__tables:
raise TableAlreadyExistsError(f"Table already exists: {identifier}")
else:
if namespace not in self.__namespaces:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other implementations don't auto-create namespaces, however I think it is fine for the InMemory one.

if not location:
location = f'{self._warehouse_location}/{"/".join(identifier)}'

metadata_location = f'{self._warehouse_location}/{"/".join(identifier)}/metadata/metadata.json'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we don't write the metadata here, but we write it below at the _commit method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, the actual writing is done by _commit_table below, but the path of the metadata location is determined here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I'm a bit confused here. If I just want to create the table without inserting any data:

catalog.create_table(schema, ....)

I still expect a new metadata.json file to be found at the table location without any call to _commit_table. But that does not seem to be created by the InMemory catalog now. Is there a reason that we choose this behavior?

In the previous implementation no file is written. But since we have updated _commit_table to write the metadata file, I think it more reasonable to make create_table aligned with other production implementation. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, that makes sense!

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Thanks for updating this to a formal implementation and adding the doc.
I just have one more comment about create_table

@@ -157,7 +157,7 @@ def describe_properties(self, properties: Properties) -> None:
Console().print(output_table)

def text(self, response: str) -> None:
Console().print(response)
Console(soft_wrap=True).print(response)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some test_console.py outputs are too long and end up with an extra \n in the middle of the string, causing tests to fail

@@ -355,8 +162,6 @@ def test_create_table_pyarrow_schema(catalog: InMemoryCatalog, pyarrow_schema_si
table = catalog.create_table(
identifier=TEST_TABLE_IDENTIFIER,
schema=pyarrow_schema_simple_without_ids,
location=TEST_TABLE_LOCATION,
partition_spec=TEST_TABLE_PARTITION_SPEC,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@syun64 FYI, I realized that the TEST_TABLE_PARTITION_SPEC here breaks this test.

TEST_TABLE_PARTITION_SPEC = PartitionSpec(PartitionField(name="x", transform=IdentityTransform(), source_id=1, field_id=1000))

The partition field's source_id here is 1, but in create_table the schema's field_ids are all -1 due to _convert_schema_if_needed

So assign_fresh_partition_spec_ids fails

original_column_name = old_schema.find_column_name(field.source_id)
if original_column_name is None:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kevinjqliu thank you for flagging this 😄 I think '-1' ID discrepancy is the symptom of the issue that makes the issue easy to understand, just as we decided in #305 (comment)

The root cause of the issue I think is that we are introducing a way for non-ID's schema (PyArrow Schema) to be used as an input into create_table, while not supporting the same for partition_spec and sort_order (PartitionField and SortField both require field IDs as inputs).

So I think we should update both assign_fresh_partition_spec_ids and assign_fresh_sort_order_ids to support field look up by name.

@Fokko - does that sound like a good way to resolve this issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #338 to track this issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you @syun64 that for creating tables having to look up the IDs is not ideal. Probably that API has to be extended at some point.

But for the metadata (and also how Iceberg internally tracks columns, since names can change; IDs not), we need to track it by ID. I'm in doubt if assigning -1 was the best idea because that will give you a table that you cannot work with. Thanks for creating the issue, and let's continue there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good @Fokko 👍 and thanks again for flagging this @kevinjqliu !

@kevinjqliu
Copy link
Contributor Author

hey @Fokko / @HonahX do you mind taking a look at this again?

@kevinjqliu kevinjqliu changed the title InMemory Catalog Implementation Improve the InMemory Catalog Implementation Mar 1, 2024
@kevinjqliu
Copy link
Contributor Author

@Fokko As we discussed in #293, let's not create yet another catalog.
I moved the changes back to test_base.py where the In-Memory catalog was originally. This PR improves the implementation along with a bunch of testing improvements

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

mkdocs/docs/configuration.md Outdated Show resolved Hide resolved
mkdocs/docs/configuration.md Outdated Show resolved Hide resolved
tests/catalog/test_base.py Outdated Show resolved Hide resolved
kevinjqliu and others added 3 commits March 5, 2024 10:57
@kevinjqliu
Copy link
Contributor Author

Thanks for the suggestions, @Fokko

@kevinjqliu kevinjqliu requested a review from Fokko March 9, 2024 16:50
@bitsondatadev
Copy link
Contributor

Thanks for working on this @kevinjqliu. The issues was created a long time ago, before we had the SqlCatalog with sqlite support. Sqlite can also work in memory rendering the InMemoryCatalog obsolete. Having two in-memory implementations in the codebase adds additional complexity in the codebase. My suggestion would be to replace the MemoryCatalog with the SqlCatalog. WDYT?

@kevinjqliu, this was likely answered offline and I suppose there was a reason to continue working here. I also am curious to know if this catalog still makes sense with inmem sqlite?

@kevinjqliu
Copy link
Contributor Author

@bitsondatadev

I also am curious to know if this catalog still makes sense with inmem sqlite?

We agreed to not move this implementation to production. See #289 (comment)
Instead, this PR is used to improve the InMemory catalog implementation in tests and use it to improve other tests

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @kevinjqliu, this is a great improvement 👍

@Fokko Fokko added this to the PyIceberg 0.7.0 release milestone Mar 13, 2024
@Fokko Fokko merged commit 36a505f into apache:main Mar 13, 2024
7 checks passed
@kevinjqliu kevinjqliu deleted the kevinliu/test branch March 13, 2024 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants