-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix overwrite when filtering all the data #1023
Conversation
Just wanted to confirm that I tried this out with the table that caused my issue #1020 and it works as expected |
Hi @ndrluis - thanks for testing and fixing this tricky issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting catch! This is when we cannot delete the file through Iceberg metadata, but we drop all the content of the Parquet file anyway. Thanks for fixing this @ndrluis 🙌
1d1f987
to
ac7b4db
Compare
if len(filtered_df) == 0: | ||
replaced_files.append((original_file.file, [])) | ||
elif len(df) != len(filtered_df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is it more readable if inlined?
if filtered_df and len(df) != len(filtered_df):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't see much of a difference.
tbl = _create_table(session_catalog, identifier, data=[data], schema=schema) | ||
tbl.overwrite(data, In("id", ["1", "2", "3"])) | ||
|
||
assert len(tbl.scan().to_arrow()) == 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since all data match the filter, the overwrite
operation is a no-op, right? if so, can we assert that in the test? maybe show that the files are the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a no-op, it's deleting the whole file. The change is in the delete method, not in the overwrite method.
I believe that testing the behavior is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see. The change is to make delete
a no-op.
Sequence of operation
- pass in `overwrite_filter which matches the entire table
- in
delete
, theoverwrite_filter
is inversed,preserve_row_filter
- use
preserve_row_filter
on data files. - if the result is empty, then we don't include this data file in deletion
Previously, we end up trying to write an empty data file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly. One thing to note is that it would be even more correct to add this to a DELETE
snapshot, it is not replaced, but just dropped. Please note that most engines just use OVERWRITE
.
ac7b4db
to
486dd61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Fixes #1020