Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Time Travel in InspectTable.entries #599

Merged
merged 5 commits into from
Apr 13, 2024

Conversation

sungwy
Copy link
Collaborator

@sungwy sungwy commented Apr 12, 2024

Time travellng on Metadata Tables allows for comparisons between two snapshots of Iceberg tables in many different and meaningful ways (files, partitions, etc)

Spark Iceberg API supports timetravel on all metadata tables except for snapshots

https://iceberg.apache.org/docs/nightly/spark-queries/?h=inspecting#time-travel-with-metadata-tables

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great improvement! i didnt know about snapshot_by_id, thanks!

if snapshot := self.tbl.metadata.snapshot_by_id(snapshot_id):
return snapshot
else:
raise ValueError(f"Cannot find snapshot with ID {snapshot_id}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if the return value is Optional[Snapshot], maybe the function should not raise ValueError

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kevinjqliu thanks for the review.

I thought about this, and I stand by this behavior / type annotation. This is my rationale:

  1. If the user passes in snapshot_id - that's an indication that the user wants to look up a specific snapshot_id (instead of using the current one). Then we should look up snapshot_by_id and see if we can find the corresponding Snapshot and if we can't find the Snapshot, we should raise.
  2. The output arg type is still Optional[Snapshot] even if we raise in the above behavior, because tbl.metadata.current_snapshot() returns Optional[Snapshot].
    def current_snapshot(self) -> Optional[Snapshot]:

Copy link
Collaborator Author

@sungwy sungwy Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I believe the above is because a newly created table isn't required to have a snapshot

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 make sense, thanks for the explanation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, and thank you again for the review!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should raise an error when the snapshot cannot be found. What do you tink of updating the signature to Snapshot, and also raise an exception when there is no current snapshot?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko Sure - I thought it would be more correct to return an empty metadata table (entries, partitions, etc) if there's no snapshot in the table than raising an Exception, but this way I think we avoid extra if statements in each of the metadata table generating methods.

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
@@ -3253,6 +3253,15 @@ def __init__(self, tbl: Table) -> None:
except ModuleNotFoundError as e:
raise ModuleNotFoundError("For metadata operations PyArrow needs to be installed") from e

def _snapshot(self, snapshot_id: Optional[int] = None) -> Optional[Snapshot]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we think of a better name that describes the function's behavior? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't think of a better one... is this a bit better than _snapshot? ^^;

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @syun64.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for adding this @syun64

@Fokko Fokko merged commit 4b91105 into apache:main Apr 13, 2024
7 checks passed
@sungwy sungwy deleted the inspect-snapshot branch April 13, 2024 19:11
@geruh geruh mentioned this pull request May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants