-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Time Travel in InspectTable.entries #599
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great improvement! i didnt know about snapshot_by_id
, thanks!
if snapshot := self.tbl.metadata.snapshot_by_id(snapshot_id): | ||
return snapshot | ||
else: | ||
raise ValueError(f"Cannot find snapshot with ID {snapshot_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if the return value is Optional[Snapshot]
, maybe the function should not raise ValueError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kevinjqliu thanks for the review.
I thought about this, and I stand by this behavior / type annotation. This is my rationale:
- If the user passes in
snapshot_id
- that's an indication that the user wants to look up a specific snapshot_id (instead of using the current one). Then we should look up snapshot_by_id and see if we can find the corresponding Snapshot and if we can't find the Snapshot, we should raise. - The output arg type is still
Optional[Snapshot]
even if we raise in the above behavior, because tbl.metadata.current_snapshot() returnsOptional[Snapshot]
.iceberg-python/pyiceberg/table/metadata.py
Line 295 in 5039b5d
def current_snapshot(self) -> Optional[Snapshot]:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I believe the above is because a newly created table isn't required to have a snapshot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 make sense, thanks for the explanation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, and thank you again for the review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should raise an error when the snapshot cannot be found. What do you tink of updating the signature to Snapshot
, and also raise an exception when there is no current snapshot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko Sure - I thought it would be more correct to return an empty metadata table (entries, partitions, etc) if there's no snapshot in the table than raising an Exception, but this way I think we avoid extra if
statements in each of the metadata table generating methods.
pyiceberg/table/__init__.py
Outdated
@@ -3253,6 +3253,15 @@ def __init__(self, tbl: Table) -> None: | |||
except ModuleNotFoundError as e: | |||
raise ModuleNotFoundError("For metadata operations PyArrow needs to be installed") from e | |||
|
|||
def _snapshot(self, snapshot_id: Optional[int] = None) -> Optional[Snapshot]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we think of a better name that describes the function's behavior? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't think of a better one... is this a bit better than _snapshot? ^^;
Co-authored-by: Fokko Driesprong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @syun64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for adding this @syun64
Time travellng on Metadata Tables allows for comparisons between two snapshots of Iceberg tables in many different and meaningful ways (files, partitions, etc)
Spark Iceberg API supports timetravel on all metadata tables except for snapshots
https://iceberg.apache.org/docs/nightly/spark-queries/?h=inspecting#time-travel-with-metadata-tables