-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Snapshot Expiration Operation #516
Comments
Metadata rewrites are already discussed in #270 I think snapshot expiration is one that we should also focus on. I think #511 is a prerequisite to ensure we have a structured way of listing all the metadata so we know which files we can remove. This would then just be operations on a PyArrow Table 🎉 |
Noted. Adjusting the Issue title as suggested 👍 |
@Fokko |
I started a discussion on the mailing list about the delete orphan files and meanwhile I'm studying the expire snapshots. |
Thanks @ndrluis - would you like me to assign this issue to you? |
Hello everyone, I need some help with this. During the implementation process, I noticed that we lack some features that exist in the Java implementation. The first one I want to discuss is the TableOperations class. In the RemoveSnapshots class, the Currently, I can use the The problem is that we require an implementation for each catalog. Currently, the Java implementation uses the My question is: how should we handle the commit process? I don't have a solid opinion on how we can solve this. Just to clarify, this commit method is only for updating the table metadata. |
After reading my question again, I'm not sure if I was clear enough. I want to discuss the design of the implementation in more detail. I'm currently creating a class (ExpireSnapshot) to handle the expiration process (similar to RemoveSnapshots in Java) and also classes that will manage file deletion based on the updated metadata (ReachableFileCleanup and IncrementalFileCleanup). I believe this design makes sense, but my question is about where the TableMetadata commit should live. Does it make more sense to create a TableOperations class (with an implementation for each catalog), or should we add a method to each catalog that handles the commit process? |
The current way to commit involves the public
Do you mind starting a doc on what you've found so far? It would be helpful to figure out what is needed for snapshot expiration before diving into how to implement it |
@kevinjqliu Thank you, I will start writing a document next week detailing what I found and explaining the differences in the commit operation and what we don't have in Python. |
@kevinjqliu I believe that I now understand the differences in how we perform TableMetadata updates in Python versus how it's done in Java. I think that the set of classes describing the changes, along with the use of single dispatch, makes it a bit harder to understand compared to the Java builder strategy. I will continue working on the document to discuss other points. |
Hey @ndrluis, what's the status here? Do you have the document ready to be shared? |
Hello @pp-gborodin. I have started the development by adding the necessary components before implementing the snapshot expiration operation (like #1285). This week, I will develop the partition statistics, and then I will begin implementing the snapshot expiration. |
Feature Request / Improvement
Support Maintenance operations on PyIceberg: https://iceberg.apache.org/docs/1.4.0/maintenance
All operations except for data file compaction are metadata-only or file system operations, so supporting them on PyIceberg may be a small lift
The text was updated successfully, but these errors were encountered: