Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Append Scan #533

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

hililiwei
Copy link
Contributor

@hililiwei hililiwei commented Mar 19, 2024

Hi,@Fokko, long time no see. 😄
I have written some preliminary code for incremental reading, but it still needs a lot of work. However, I would like to discuss it with you at an early stage as it will help me stay on the right track. Could you please take a look at it when you have a chance? Thank you.

@hililiwei hililiwei changed the title Incremental Append Scan WIP: Incremental Append Scan Mar 19, 2024
@ndrluis ndrluis mentioned this pull request Mar 19, 2024
8 tasks
@@ -1578,6 +1595,120 @@ def to_ray(self) -> ray.data.dataset.Dataset:
return ray.data.from_arrow(self.to_arrow())


class IncrementalAppendScan(DataScan):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here. Should I modify the inheritance structure? The current inheritance structure is TableScan -> DataScan -> IncrementalAppendScan. Should I create a class, like BaseScan, and make both DataScan and IncrementalAppendScan inherit from it, and move the snapshot_id snapshot() from TableScan into DataScan? Like:

TableScan  
    >BaseScan 
         >DataScan 
         >IncrementalAppendScan

@hililiwei
Copy link
Contributor Author

In the latest code commit, I tinkered with the class inheritance by introducing a new base class, BaseIncrementalScan, which inherits from TableScan. I also pushed the snapshot_id down to DataScan and shuffled a few methods around (which might cause some backward compatibility issues 💔 ). How do you think I can improve it? @Fokko

@hililiwei hililiwei changed the title WIP: Incremental Append Scan Incremental Append Scan Apr 2, 2024
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @hililiwei sorry for the late reply, for some reason this fell of my radar. This looks like a great start. Excited to see this coming to PyIceberg as well 👍

pyiceberg/manifest.py Outdated Show resolved Hide resolved
Comment on lines +1632 to +1980
to_snapshot_id: Optional[int] = None,
from_snapshot_id_exclusive: Optional[int] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worthwhile to add a docstring there to describe these parameters to help the end-user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments, but I'm not sure if they are well-written. Please review them, thx.

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
@hililiwei hililiwei force-pushed the incremental branch 4 times, most recently from 57f6551 to d1207f8 Compare April 30, 2024 10:47
@hililiwei
Copy link
Contributor Author

Sorry for the late correction. I've adjusted the code based on the latest comments. Could you please take a look?

@Fokko
Copy link
Contributor

Fokko commented May 23, 2024

@hililiwei I'm sorry, this also fell off my radar.

@Fokko Fokko mentioned this pull request May 23, 2024
39 tasks
Returns:
this for method chaining
"""
return self.update(to_snapshot_id=to_snapshot_id)
Copy link
Contributor

@chinmay-bhat chinmay-bhat May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for the contribution!
I think in this implementation, to_snapshot_id doesn't have a default value if to_snapshot_id is not provided. Can you add the default value as mentioned in the docstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the parameter "to_snapshot_id" is not set, we assume that the user intends to get the latest data in the table, so I will fetch the latest snapshot_id of the table.

        if self.to_snapshot_id is None:
            current_snapshot = self.table_metadata.current_snapshot()
            if current_snapshot is None:
                raise ValueError("End snapshot is not set and table has no current snapshot")
            self.to_snapshot_id = current_snapshot.snapshot_id

Currently, this is being done in the plan_files(), but we can also move it forward to __init__

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think keeping it in plan_files() should be fine! As to_snapshot_id and from_snapshot_id_exclusive are Optional[int], do we want to handle the case when the ids are None?

For ex:

def to_snapshot(self: S, to_snapshot_id: Optional[int]) -> self:
    if to_snapshot_id is None:
        return self
    return self.update(to_snapshot_id=to_snapshot_id)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Returns:
this for method chaining
"""
return self.update(from_snapshot_id_exclusive=from_snapshot_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here also, from_snapshot_id also doesn't have a default value, as mentioned in the docstring. Should we add it here or is it being set somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the tool method ancestors_between, it will handle the situation where from_snapshot_id is empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks

@chinmay-bhat chinmay-bhat mentioned this pull request May 30, 2024
1 task
@abstractmethod
def _do_plan_files(self, from_snapshot_id_exclusive: int, to_snapshot_id: int) -> Iterable[FileScanTask]: ...

def plan_files(self) -> Iterable[FileScanTask]:
Copy link
Contributor

@chinmay-bhat chinmay-bhat Jun 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for plan_files() to return Iterable[ScanTask]?
As this class is the base for incremental scans, this change will allow returning iterable of child classes of ScanTask including FileScanTask.
For example, ChangelogScanTask(ScanTask), which will be introduced in incremental changelog scan PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@glesperance
Copy link

I managed to get a poor mans append-scan with this #240 (comment)

Looking at this PR wouldn't it be simpler to implement append-scan in the api by adding a append_scan method to Table, then refactoring plan_files to take an optional snapshot_id, and providing a lightweight AppendScan class that makes 2 calls to plan_files and then compares?

In my case there was no need for touching __eq__ or __hash__

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants