-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Fix RewriteTablePath Incremental Replication #12172
Core: Fix RewriteTablePath Incremental Replication #12172
Conversation
…ewrite_table_path_incremental_fix
…ewrite_table_path_incremental_fix
This approach assumes that all versions before |
@szehon-ho - Would love to get your feedback on these changes. |
I did some look on this, I think right now the only strict delta manifest-list/manifests/data files are rewritten if 1st snapshot: 4355148708777346214
So incremental copy after 1st snapshot, read of target table ends up with only the delta for all iceberg metadata related to snapshot |
So if I understand the problem correctly, today we filter the manifest based on snapshot when rewrites manifest list for incremental rewrite. snapshot based filtering shall only apply to copy plan but not change the content when rewrite the manifest list with the new path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change LGTM, some comments on unit tests
...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Outdated
Show resolved
Hide resolved
...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Outdated
Show resolved
Hide resolved
...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Outdated
Show resolved
Hide resolved
...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Outdated
Show resolved
Hide resolved
FYI @flyrain @amogh-jahagirdar I think we might want take this fix as part of 1.8 release if possible. |
…ewrite_table_path_incremental_fix
This looks good, it looks like a similar issue as #12006 but for metadata files. |
Thanks a lot @barronfuentes and also @dramaticlly for evaluating and reviewing it! |
This contribution attempts to resolve a bug with the RewriteTablePath action in Apache Iceberg. While the current implementation of RewriteTablePath supports the ability to replicate a portion of an Iceberg tables history, it does not support the ability to do so repeatedly while maintaining history of previous replicated portions in the target table.
For example, assume an Iceberg table has two versions: A, B. Immediately after the first version (A) was created in the table, the Rewrite Table Path action was used to replicate the table for disaster recovery. After the second version (B) was added to the source table, Rewrite Table Path was used once again to replicate (append) only the second version to the target table. In the current state, the target table will have all of the metadata and data files associated with both versions A & B, but only the records contained within the most recent replicated version (B) will be available to queries. This pull request aims to address this issue and provide the ability to incrementally replicate Iceberg tables in a manner which results in a target table with contents that are consistent with the source table.
The cause of the issue exists in how new Manifest List files are created. A filter which prevents rewriting Manifest Files out of scope for replication is also applied to the content of rewritten Manifest Lists. The exclusion of these historical references is the reason why queries can only access records contained in the latest replication and not the previous replications.