-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding add_files_overwrite
method
#810
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very cool to see that others are finding the new add_files
API useful 😄
I left a comment - there's an open PR that I think we should base this implementation on.
pyiceberg/table/__init__.py
Outdated
""" | ||
if self._table.name_mapping() is None: | ||
self.set_properties(**{TableProperties.DEFAULT_NAME_MAPPING: self._table.schema().name_mapping.model_dump_json()}) | ||
with self.update_snapshot(snapshot_properties=snapshot_properties).overwrite() as update_snapshot: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an open PR to implement partial deletes that I think we should leverage for this API.
Similar to the proposed implementation of overwrite, instead of calling overwrite() I think we'd want to invoke self.delete()
with the overwrite_filter
to delete all or partial data and rewrite them, and then add the data files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks @syun64 . I'll rebase my branch then and use self.delete()
. However, we are also interested in keeping the previous data, so we keep history not changed even on overwriting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@syun64 I've rebased on Fokko:fd-add-ability-to-delete-full-data-files
branch and used delete
function in add_files_overwrite
. The tests I added are passing, but it made it hard to review this PR, probably makes sense to put it on pause until fokko's PR is merged. I'll still be pushing more updates to tests.
As for add_files_overwrite
, at this point, I'm interested only in overwriting all, so I'm not adding the overwrite_filter
argument like the overwrite
method has. But I can add overwrite_filter
argument if you think it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you will still be able to achieve that by setting the filter as ALWAYS_TRUE
My opinion on this is that when we introduce a new API, we should try to make it serve a more generic use case (overwriting based on a filter) versus a very specific use case (overwriting all data), especially given that there's already a great similar API pattern in overwrite
. I think mimicking the inputs there will make the API feel more consistent for other users as well.
Another option we can take to support your specific use case, if the community decides that we don't want to add a new API for add_files_overwrite
, is to just build your own transaction like below:
with tbl.transaction() as txn:
txn.delete(delete_filter=ALWAYS_TRUE)
txn.add_files(file_paths=file_paths)
This will achieve your desired outcome in a single atomic transaction to the iceberg table without needing to add a new API to the repository
2d17fc4
to
09193eb
Compare
bb9a4d1
to
6201cb9
Compare
6201cb9
to
a9d8a1a
Compare
Hi @enkidulan - thank you very much for putting in the time to write up this PR. I'm very appreciative of the work and the level of interest you have in the new API I gave this a bit more thought, and I think with any new Public APIs we introduce to the repository, we carefully consider whether the value added by adding a new API outweighs the cost of maintaining it. A new public API means we have a new public function that we need to support for our community to ensure that the feature remains backward compatible. While we could argue that the feature is simple enough since it's an extension of two already public facing APIs, I believe that that also leads to the counterargument, that if the function is so simple and can be achieved as a combination of existing public APIs and have no difference in the functionality I question why we would want to create the work of maintaining a public API.
What do you think? Are you able to achieve your desired functional outcome without the proposed API? |
Thanks for getting back to me @sungwy . I agree with your point of view and respect the decision. I started working on this feature before #569 was introduced, and back then |
Issue: #809
Blockers: #569 (this is what this PR is based on)