-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental Append Scan #533
base: main
Are you sure you want to change the base?
Conversation
pyiceberg/table/__init__.py
Outdated
@@ -1578,6 +1595,120 @@ def to_ray(self) -> ray.data.dataset.Dataset: | |||
return ray.data.from_arrow(self.to_arrow()) | |||
|
|||
|
|||
class IncrementalAppendScan(DataScan): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused here. Should I modify the inheritance structure? The current inheritance structure is TableScan -> DataScan -> IncrementalAppendScan. Should I create a class, like BaseScan, and make both DataScan and IncrementalAppendScan inherit from it, and move the snapshot_id
snapshot()
from TableScan into DataScan? Like:
TableScan
>BaseScan
>DataScan
>IncrementalAppendScan
In the latest code commit, I tinkered with the class inheritance by introducing a new base class, BaseIncrementalScan, which inherits from TableScan. I also pushed the snapshot_id down to DataScan and shuffled a few methods around (which might cause some backward compatibility issues 💔 ). How do you think I can improve it? @Fokko |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @hililiwei sorry for the late reply, for some reason this fell of my radar. This looks like a great start. Excited to see this coming to PyIceberg as well 👍
to_snapshot_id: Optional[int] = None, | ||
from_snapshot_id_exclusive: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worthwhile to add a docstring there to describe these parameters to help the end-user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments, but I'm not sure if they are well-written. Please review them, thx.
57f6551
to
d1207f8
Compare
Sorry for the late correction. I've adjusted the code based on the latest comments. Could you please take a look? |
@hililiwei I'm sorry, this also fell off my radar. |
Returns: | ||
this for method chaining | ||
""" | ||
return self.update(to_snapshot_id=to_snapshot_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for the contribution!
I think in this implementation, to_snapshot_id
doesn't have a default value if to_snapshot_id
is not provided. Can you add the default value as mentioned in the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the parameter "to_snapshot_id" is not set, we assume that the user intends to get the latest data in the table, so I will fetch the latest snapshot_id of the table.
if self.to_snapshot_id is None:
current_snapshot = self.table_metadata.current_snapshot()
if current_snapshot is None:
raise ValueError("End snapshot is not set and table has no current snapshot")
self.to_snapshot_id = current_snapshot.snapshot_id
Currently, this is being done in the plan_files()
, but we can also move it forward to __init__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think keeping it in plan_files() should be fine! As to_snapshot_id
and from_snapshot_id_exclusive
are Optional[int], do we want to handle the case when the ids are None?
For ex:
def to_snapshot(self: S, to_snapshot_id: Optional[int]) -> self:
if to_snapshot_id is None:
return self
return self.update(to_snapshot_id=to_snapshot_id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Returns: | ||
this for method chaining | ||
""" | ||
return self.update(from_snapshot_id_exclusive=from_snapshot_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here also, from_snapshot_id
also doesn't have a default value, as mentioned in the docstring. Should we add it here or is it being set somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the tool method ancestors_between
, it will handle the situation where from_snapshot_id
is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks
pyiceberg/table/__init__.py
Outdated
@abstractmethod | ||
def _do_plan_files(self, from_snapshot_id_exclusive: int, to_snapshot_id: int) -> Iterable[FileScanTask]: ... | ||
|
||
def plan_files(self) -> Iterable[FileScanTask]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for plan_files()
to return Iterable[ScanTask]
?
As this class is the base for incremental scans, this change will allow returning iterable of child classes of ScanTask including FileScanTask.
For example, ChangelogScanTask(ScanTask), which will be introduced in incremental changelog scan PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
support incremental append scan support incremental append scan
I managed to get a poor mans append-scan with this #240 (comment) Looking at this PR wouldn't it be simpler to implement append-scan in the api by adding a In my case there was no need for touching |
Hi,@Fokko, long time no see. 😄
I have written some preliminary code for incremental reading, but it still needs a lot of work. However, I would like to discuss it with you at an early stage as it will help me stay on the right track. Could you please take a look at it when you have a chance? Thank you.