-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the InMemory Catalog Implementation #289
Conversation
9242e69
to
c09e7e9
Compare
c09e7e9
to
df17ee3
Compare
pyiceberg/catalog/in_memory.py
Outdated
from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder | ||
from pyiceberg.typedef import EMPTY_DICT | ||
|
||
DEFAULT_WAREHOUSE_LOCATION = "file:///tmp/warehouse" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by default, write on disk to /tmp/warehouse
pyiceberg/catalog/in_memory.py
Outdated
super().__init__(name, **properties) | ||
self.__tables = {} | ||
self.__namespaces = {} | ||
self._warehouse_location = properties.get(WAREHOUSE, None) or DEFAULT_WAREHOUSE_LOCATION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can pass a warehouse location using properties. warehouse location can be another fs such as s3
@pytest.fixture | ||
def catalog() -> InMemoryCatalog: | ||
return InMemoryCatalog("test.in.memory.catalog", **{"test.key": "test.value"}) | ||
def catalog(tmp_path: PosixPath) -> InMemoryCatalog: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added ability to write to temporary files for testing, which is then automatically cleaned up
4842d91
to
a56838d
Compare
tests/catalog/test_base.py
Outdated
@@ -585,8 +397,10 @@ def test_commit_table(catalog: InMemoryCatalog) -> None: | |||
|
|||
# Then | |||
assert response.metadata.table_uuid == given_table.metadata.table_uuid | |||
assert len(response.metadata.schemas) == 1 | |||
assert response.metadata.schemas[0] == new_schema | |||
assert given_table.metadata.current_schema_id == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! @kevinjqliu I left some initial comments below.
Thanks for working on this @kevinjqliu. The issues was created a long time ago, before we had the SqlCatalog with sqlite support. Sqlite can also work in memory rendering the |
a56838d
to
9944896
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @kevinjqliu
Should we also add this catalog to the tests in tests/integration/test_reads.py
?
pyiceberg/catalog/__init__.py
Outdated
AVAILABLE_CATALOGS: dict[CatalogType, Callable[[str, Properties], Catalog]] = { | ||
CatalogType.REST: load_rest, | ||
CatalogType.HIVE: load_hive, | ||
CatalogType.GLUE: load_glue, | ||
CatalogType.DYNAMODB: load_dynamodb, | ||
CatalogType.SQL: load_sql, | ||
CatalogType.MEMORY: load_memory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add this one to the docs: https://py.iceberg.apache.org/configuration/ With a warning that this is just for testing purposes only.
pyiceberg/catalog/in_memory.py
Outdated
if identifier in self.__tables: | ||
raise TableAlreadyExistsError(f"Table already exists: {identifier}") | ||
else: | ||
if namespace not in self.__namespaces: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other implementations don't auto-create namespaces, however I think it is fine for the InMemory one.
pyiceberg/catalog/in_memory.py
Outdated
if not location: | ||
location = f'{self._warehouse_location}/{"/".join(identifier)}' | ||
|
||
metadata_location = f'{self._warehouse_location}/{"/".join(identifier)}/metadata/metadata.json' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we don't write the metadata here, but we write it below at the _commit
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, the actual writing is done by _commit_table
below, but the path of the metadata location is determined here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I'm a bit confused here. If I just want to create the table without inserting any data:
catalog.create_table(schema, ....)
I still expect a new metadata.json
file to be found at the table location without any call to _commit_table
. But that does not seem to be created by the InMemory catalog now. Is there a reason that we choose this behavior?
In the previous implementation no file is written. But since we have updated _commit_table
to write the metadata file, I think it more reasonable to make create_table
aligned with other production implementation. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, that makes sense!
9944896
to
be3eb1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Thanks for updating this to a formal implementation and adding the doc.
I just have one more comment about create_table
@@ -157,7 +157,7 @@ def describe_properties(self, properties: Properties) -> None: | |||
Console().print(output_table) | |||
|
|||
def text(self, response: str) -> None: | |||
Console().print(response) | |||
Console(soft_wrap=True).print(response) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some test_console.py
outputs are too long and end up with an extra \n
in the middle of the string, causing tests to fail
@@ -355,8 +162,6 @@ def test_create_table_pyarrow_schema(catalog: InMemoryCatalog, pyarrow_schema_si | |||
table = catalog.create_table( | |||
identifier=TEST_TABLE_IDENTIFIER, | |||
schema=pyarrow_schema_simple_without_ids, | |||
location=TEST_TABLE_LOCATION, | |||
partition_spec=TEST_TABLE_PARTITION_SPEC, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@syun64 FYI, I realized that the TEST_TABLE_PARTITION_SPEC
here breaks this test.
TEST_TABLE_PARTITION_SPEC = PartitionSpec(PartitionField(name="x", transform=IdentityTransform(), source_id=1, field_id=1000))
The partition field's source_id
here is 1
, but in create_table
the schema
's field_id
s are all -1
due to _convert_schema_if_needed
So assign_fresh_partition_spec_ids
fails
iceberg-python/pyiceberg/partitioning.py
Lines 203 to 204 in 102e043
original_column_name = old_schema.find_column_name(field.source_id) | |
if original_column_name is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kevinjqliu thank you for flagging this 😄 I think '-1' ID discrepancy is the symptom of the issue that makes the issue easy to understand, just as we decided in #305 (comment)
The root cause of the issue I think is that we are introducing a way for non-ID's schema (PyArrow Schema) to be used as an input into create_table, while not supporting the same for partition_spec and sort_order (PartitionField and SortField both require field IDs as inputs).
So I think we should update both assign_fresh_partition_spec_ids and assign_fresh_sort_order_ids to support field look up by name.
@Fokko - does that sound like a good way to resolve this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #338 to track this issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you @syun64 that for creating tables having to look up the IDs is not ideal. Probably that API has to be extended at some point.
But for the metadata (and also how Iceberg internally tracks columns, since names can change; IDs not), we need to track it by ID. I'm in doubt if assigning -1
was the best idea because that will give you a table that you cannot work with. Thanks for creating the issue, and let's continue there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good @Fokko 👍 and thanks again for flagging this @kevinjqliu !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Thanks for the suggestions, @Fokko |
@kevinjqliu, this was likely answered offline and I suppose there was a reason to continue working here. I also am curious to know if this catalog still makes sense with inmem sqlite? |
We agreed to not move this implementation to production. See #289 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @kevinjqliu, this is a great improvement 👍
Issue #293
Improve the InMemory Catalog implementation.
In this PR:
create_table
and_commit_table
function/tmp/warehouse
)test_base.py
test_console.py
to write to a temporary file location on the local file system usingtmp_path
from pytesttest_commit_table
fromtests/catalog/test_base.py
, issue described inschema_id
not incremented during schema evolution #290