Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Snapshot Expiration Operation #516

Open
sungwy opened this issue Mar 11, 2024 · 12 comments
Open

Support Snapshot Expiration Operation #516

sungwy opened this issue Mar 11, 2024 · 12 comments
Assignees

Comments

@sungwy
Copy link
Collaborator

sungwy commented Mar 11, 2024

Feature Request / Improvement

Support Maintenance operations on PyIceberg: https://iceberg.apache.org/docs/1.4.0/maintenance

All operations except for data file compaction are metadata-only or file system operations, so supporting them on PyIceberg may be a small lift

@Fokko
Copy link
Contributor

Fokko commented Mar 12, 2024

Metadata rewrites are already discussed in #270

I think snapshot expiration is one that we should also focus on. I think #511 is a prerequisite to ensure we have a structured way of listing all the metadata so we know which files we can remove. This would then just be operations on a PyArrow Table 🎉

@sungwy
Copy link
Collaborator Author

sungwy commented Mar 12, 2024

Noted. Adjusting the Issue title as suggested 👍

@sungwy sungwy changed the title Support Maintenance Operations on PyIceberg Support Snapshot Expiration Operation Mar 12, 2024
@Gowthami03B Gowthami03B mentioned this issue Mar 14, 2024
8 tasks
@salexln
Copy link

salexln commented May 7, 2024

@Fokko
just to make sure, currently there is not support in pyiceberg for snapshot expiration?
If so, do you have any suggestions of how to remove old data (I'm writing data using pyiceberg to AWS Glue)

@ndrluis
Copy link
Collaborator

ndrluis commented Aug 15, 2024

I started a discussion on the mailing list about the delete orphan files and meanwhile I'm studying the expire snapshots.

@sungwy
Copy link
Collaborator Author

sungwy commented Aug 15, 2024

Thanks @ndrluis - would you like me to assign this issue to you?

@ndrluis ndrluis self-assigned this Aug 15, 2024
@ndrluis
Copy link
Collaborator

ndrluis commented Sep 10, 2024

Hello everyone, I need some help with this.

During the implementation process, I noticed that we lack some features that exist in the Java implementation. The first one I want to discuss is the TableOperations class. In the RemoveSnapshots class, the io, current, refresh, and commit methods are used.

Currently, I can use the io from our Table implementation, and I can get the same behavior through Table to get and refresh the TableMetadata. However, I haven't found an implementation to commit TableMetadata.

The problem is that we require an implementation for each catalog. Currently, the Java implementation uses the TableOperations interface for the REST catalog, but the other catalogs (Snowflake / JDBC / Glue) use a different interface with the same methods.

My question is: how should we handle the commit process? I don't have a solid opinion on how we can solve this.

Just to clarify, this commit method is only for updating the table metadata.

cc/ @Fokko @HonahX @kevinjqliu @sungwy

@ndrluis
Copy link
Collaborator

ndrluis commented Sep 10, 2024

After reading my question again, I'm not sure if I was clear enough. I want to discuss the design of the implementation in more detail.

I'm currently creating a class (ExpireSnapshot) to handle the expiration process (similar to RemoveSnapshots in Java) and also classes that will manage file deletion based on the updated metadata (ReachableFileCleanup and IncrementalFileCleanup). I believe this design makes sense, but my question is about where the TableMetadata commit should live. Does it make more sense to create a TableOperations class (with an implementation for each catalog), or should we add a method to each catalog that handles the commit process?

@kevinjqliu
Copy link
Contributor

how should we handle the commit process?

The current way to commit involves the public commit_table method implemented for each catalog (ie SQL catalog). It takes in requirements and updates, similar to the RESTTableOperations::commit function.

I want to discuss the design of the implementation in more detail.

Do you mind starting a doc on what you've found so far? It would be helpful to figure out what is needed for snapshot expiration before diving into how to implement it

@ndrluis
Copy link
Collaborator

ndrluis commented Sep 12, 2024

@kevinjqliu Thank you, I will start writing a document next week detailing what I found and explaining the differences in the commit operation and what we don't have in Python.

@ndrluis
Copy link
Collaborator

ndrluis commented Sep 19, 2024

@kevinjqliu I believe that I now understand the differences in how we perform TableMetadata updates in Python versus how it's done in Java. I think that the set of classes describing the changes, along with the use of single dispatch, makes it a bit harder to understand compared to the Java builder strategy. I will continue working on the document to discuss other points.

@pp-gborodin
Copy link

Hey @ndrluis, what's the status here? Do you have the document ready to be shared?

@ndrluis
Copy link
Collaborator

ndrluis commented Jan 22, 2025

Hello @pp-gborodin. I have started the development by adding the necessary components before implementing the snapshot expiration operation (like #1285). This week, I will develop the partition statistics, and then I will begin implementing the snapshot expiration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants