-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refine: refine interface of ManifestWriter #738
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ZENOTME for this pr!
crates/iceberg/src/spec/manifest.rs
Outdated
/// Write a manifest. | ||
pub async fn write(mut self, manifest: Manifest) -> Result<ManifestFile> { | ||
/// Add a new manifest entry. | ||
pub fn add(&mut self, mut entry: ManifestEntry) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of weird of manipulating arguments, how about make the arguments DataFile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applies to other apis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason here use ManifesEntry is that in some case we will add entry from other Manifest. In this case, there are some info we need from original ManifsetEntry. E.g. when we add the delete manifest entry, we change the snapshot id and keep the original sequence number.
/// Add a delete manifest entry.
pub fn delete(&mut self, mut entry: ManifestEntry) -> Result<()> {
entry.status = ManifestStatus::Deleted;
entry.snapshot_id = Some(self.snapshot_id);
self.add_entry(entry)?;
Ok(())
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced. If we ask user to provide ManifestEntry
, it would be confusing to user which part will be used and which part not. I think the style in java would be more clear from a user's view. If we to use ManifestEntry
approach, we must have clear documentation about the behavior of each part, e.g. which is ignored, which is reserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we to use ManifestEntry approach, we must have clear documentation about the behavior of each part, e.g. which is ignored, which is reserved.
I agree with this. Then I think these functions can be pub(crate)
to ensure public users will not use it. I think for now there is no demand that user need to use this API.🤔
crates/iceberg/src/spec/manifest.rs
Outdated
} | ||
|
||
/// Write manifest file and return it. | ||
pub async fn to_manifest_file(mut self, metadata: ManifestMetadata) -> Result<ManifestFile> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns with this api, since it's error prone. According to iceberg's spec, each manifest file should contains one type of data file: data or deletes. It's quite possible that the user pass different kinds entries in previouse method, then the metadata is different. My suggestion is to follow java/python's approach:
- A factory method like
pub fn new_v1_writer(...) {}
pub fn new_v2_writer(...) {}
pub fn new_v2_delete_writer(...) {}
- We could use things like trait or enum to abstract out common parts of different writers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use things like trait or enum to abstract out common parts of different writers.
Difference between v1, v2, delete is:
- the metadata of avro file
- avro schema
- content type
- check in
add_entry
to make sureentry.content_type == writer.content_type
- serialize the ManifestEntry
I think both differences except for serializing the ManifestEntry can be implemented by storing different data in the writer when we create the writer using the factory method. So do we really need to abstract out common parts of different writers now?🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine without trait/enum, the focus is factory methods to ensure api safety.
76ae53d
to
491d60f
Compare
491d60f
to
b39d7db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ZENOTME for this pr, generally LGTM! I still have concerns with the add/eixsting/delete api, and prefer the approach used in java api: org.apache.iceberg.ManifestWriter#add(F), which provides better api safety. For what I mentioned in comments, it's possible to add some check, but it's not a good api for user which throws error at runtime.
crates/iceberg/src/spec/manifest.rs
Outdated
/// Create a new builder. | ||
pub fn new( | ||
output: OutputFile, | ||
snapshot_id: i64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which value we should assign in https://github.com/apache/iceberg-rust/blob/b39d7db8e30400e9bd77a82ecc85a497327f47b8/crates/iceberg/src/spec/manifest.rs#L484C13-L484C26 if it's none. It's a required field. https://iceberg.apache.org/spec/#manifest-lists:~:text=503%20added_snapshot_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use some special value like -1 here. The reason it's optional is that when we append a data file here, the snapshot id is unknown. Actual snapshot id is determined when do commit, which may fail and retry.
crates/iceberg/src/spec/manifest.rs
Outdated
|
||
/// Add an existing manifest entry. This method will update following status of the entry: | ||
/// - Update the entry status to `Existing` | ||
pub fn existing(&mut self, mut entry: ManifestEntry) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This incorrect, an existing entry requires user to provide snapshot id, data sequence number, which are all optional in ManifestEntry.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems iceberg-java also has ian nterface like here: https://github.com/apache/iceberg/blob/d96901b843395fe669f6bd4f618f8e5e46c0eed4/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L157. And looks like it also support the case the existing entry without snapshot id.🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are different cases, one is public api, another is package private. They have different callers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I got you. I think the interface here is package private. I forget to mark them as pub(crate)
to avoid confusion.
crates/iceberg/src/spec/manifest.rs
Outdated
/// Add a delete manifest entry. This method will update following status of the entry: | ||
/// - Update the entry status to `Deleted` | ||
/// - Set the snapshot id to the current snapshot id | ||
pub fn delete(&mut self, mut entry: ManifestEntry) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also incorrect. The sequence number must be provided.
Hi @liurenjie1024, seems iceberg-java also provide the interface for entry in https://github.com/apache/iceberg/blob/d96901b843395fe669f6bd4f618f8e5e46c0eed4/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L157, and use them in ManifsetMergeManager: https://github.com/apache/iceberg/blob/d96901b843395fe669f6bd4f618f8e5e46c0eed4/core/src/main/java/org/apache/iceberg/ManifestMergeManager.java#L188. And it also don't check some case like existing entry with null data sequence number. Is the case that iceberg-java missed or it's acceptable. |
d4a569c
to
b701bd9
Compare
1. adopt factory method to build different type manifest writer 2. provide add, exist, delete method
2. mark add entry method as crate private
b701bd9
to
97c0369
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ZENOTME for this great pr! Just left some minor suggestions, and we are close !
crates/iceberg/src/spec/manifest.rs
Outdated
data_file: DataFile, | ||
snapshot_id: i64, | ||
sequence_number: i64, | ||
file_sequence_number: i64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be optional, file_sequence_number
could be inherited from snapshot.
crates/iceberg/src/spec/manifest.rs
Outdated
&mut self, | ||
data_file: DataFile, | ||
sequence_number: i64, | ||
file_sequence_number: i64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be optional, file_sequence_number could be inherited from snapshot.
crates/iceberg/src/spec/manifest.rs
Outdated
@@ -41,6 +41,9 @@ use crate::io::OutputFile; | |||
use crate::spec::PartitionField; | |||
use crate::{Error, ErrorKind}; | |||
|
|||
/// Placeholder for snapshot ID. The field with this value must be replaced with the actual snapshot ID before it is committed. | |||
pub const UNASSIGNED_SNAPSHOT_ID: i64 = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move this to snapshot
module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ZENOTME for working on this. The design mostly look good to me. Only have some question about the API naming.
crates/iceberg/src/spec/manifest.rs
Outdated
/// # TODO | ||
/// Remove this allow later | ||
#[allow(dead_code)] | ||
pub(crate) fn existing(&mut self, mut entry: ManifestEntry) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, the API naming seems a bit unclear. If we are "adding an existing manifest entry," how about naming this API add_existing_entry
?
crates/iceberg/src/spec/manifest.rs
Outdated
/// # TODO | ||
/// Remove this allow later | ||
#[allow(dead_code)] | ||
pub(crate) fn delete(&mut self, mut entry: ManifestEntry) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe delete_entry
?
crates/iceberg/src/spec/manifest.rs
Outdated
/// - Set the snapshot id to the current snapshot id | ||
/// - Set the sequence number to `None` if it is invalid(smaller than 0) | ||
/// - Set the file sequence number to `None` | ||
pub(crate) fn add(&mut self, mut entry: ManifestEntry) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add_entry
?
crates/iceberg/src/spec/manifest.rs
Outdated
|
||
/// Add an existing manifest entry. The original data and file sequence numbers, snapshot ID, | ||
/// which were assigned at commit, must be preserved when adding an existing entry. | ||
pub fn existing_file( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about add_existing_file
?
crates/iceberg/src/spec/manifest.rs
Outdated
} | ||
|
||
/// Write manifest file and return it. | ||
pub async fn to_manifest_file(mut self) -> Result<ManifestFile> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_manifest_file
sounds more like a conversion function, but it actually involves heavy I/O operations. How about using the name suggested in the comments: write_manifest_file
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ZENOTME for this pr, LGTM!
Let's wait for a moment to see if @Xuanwo has other comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ZENOTME for working on this, let move!
This PR refine the write interface of ManifestWriter according to ManifestWriter from pyiceberg. It add 3 interface
add
,delete
,existing
which will rewrite some metadata of manifest entry, e.g. snapshot id, sequence number, file sequence number.These refined interfaces are benefit for MergeAppend.( I' m working on it now.