Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix overwrite when filtering all the data #1023

Merged
merged 1 commit into from
Aug 9, 2024

Conversation

ndrluis
Copy link
Collaborator

@ndrluis ndrluis commented Aug 8, 2024

Fixes #1020

@ndrluis ndrluis added this to the PyIceberg 0.7.1 release milestone Aug 8, 2024
@ndrluis ndrluis requested review from Fokko, kevinjqliu and HonahX August 8, 2024 13:23
@Minfante377
Copy link
Contributor

Just wanted to confirm that I tried this out with the table that caused my issue #1020 and it works as expected

@sungwy
Copy link
Collaborator

sungwy commented Aug 8, 2024

Hi @ndrluis - thanks for testing and fixing this tricky issue.

@ndrluis ndrluis requested a review from sungwy August 8, 2024 14:35
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting catch! This is when we cannot delete the file through Iceberg metadata, but we drop all the content of the Parquet file anyway. Thanks for fixing this @ndrluis 🙌

@ndrluis ndrluis force-pushed the fix-overwrite-with-filter branch 2 times, most recently from 1d1f987 to ac7b4db Compare August 8, 2024 15:23
Comment on lines +596 to +598
if len(filtered_df) == 0:
replaced_files.append((original_file.file, []))
elif len(df) != len(filtered_df):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it more readable if inlined?

if filtered_df and len(df) != len(filtered_df):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't see much of a difference.

tbl = _create_table(session_catalog, identifier, data=[data], schema=schema)
tbl.overwrite(data, In("id", ["1", "2", "3"]))

assert len(tbl.scan().to_arrow()) == 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since all data match the filter, the overwrite operation is a no-op, right? if so, can we assert that in the test? maybe show that the files are the same

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a no-op, it's deleting the whole file. The change is in the delete method, not in the overwrite method.

I believe that testing the behavior is enough.

Copy link
Contributor

@kevinjqliu kevinjqliu Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see. The change is to make delete a no-op.
Sequence of operation

  • pass in `overwrite_filter which matches the entire table
  • in delete, the overwrite_filter is inversed, preserve_row_filter
  • use preserve_row_filter on data files.
  • if the result is empty, then we don't include this data file in deletion

Previously, we end up trying to write an empty data file.

Copy link
Contributor

@Fokko Fokko Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly. One thing to note is that it would be even more correct to add this to a DELETE snapshot, it is not replaced, but just dropped. Please note that most engines just use OVERWRITE.

@ndrluis ndrluis force-pushed the fix-overwrite-with-filter branch from ac7b4db to 486dd61 Compare August 8, 2024 15:50
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Fokko Fokko merged commit 8f2e787 into apache:main Aug 9, 2024
7 checks passed
sungwy pushed a commit that referenced this pull request Aug 9, 2024
sungwy pushed a commit that referenced this pull request Aug 9, 2024
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Overwrite with filter division by zero error
5 participants