-
Notifications
You must be signed in to change notification settings - Fork 282
feat: add RollingManifestWriter
#650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add RollingManifestWriter
#650
Conversation
4f45b47
to
f558e34
Compare
I believe in the Java implementation we have a concept of a PositionOutputStream which is used to keep track of bytes written to each file with a position/counter. What we can do here is extend the current For instance, the ManifestWriter can use the |
Sounds good, I will have a look at the implementation and make a suggestion. Thank you! |
f558e34
to
ee63925
Compare
Hi, I finally had some time to continue working on this. Based on your suggestions @geruh I added a I initially tried to extend If we wanted to go with What do you think? |
42285e3
to
bdb8d2d
Compare
bdb8d2d
to
da96ced
Compare
da96ced
to
f34d9f9
Compare
pyiceberg/manifest.py
Outdated
self._current_file_rows = 0 | ||
|
||
def to_manifest_files(self) -> list[ManifestFile]: | ||
self._close_current_writer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the same pattern as in Java, where the to_manifest_files
call expects the writer to be closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to raise a RuntimeError
if the writer is not closed, similar to how trying to add an entry to a closed writer raises a RuntimeError
.
pyiceberg/manifest.py
Outdated
traceback: Optional[TracebackType], | ||
) -> None: | ||
self.closed = True | ||
if self._current_writer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not re-use _close_current_writer
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I changed it to use _close_current_writer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixscherz Sorry for the late reply here. It looks like the formatting is a bit off, could you check that one?
@Fokko Thanks for taking a look! Sorry about the formatting, should be fixed now:) |
e1893a0
to
869ea57
Compare
…th `__len__` method
869ea57
to
9f01e5a
Compare
Hi, this is in regards to #596 and still WIP.
The
RollingManifestWriter
implementation closely follows the java implementation.It takes in a generator that produces
ManifestWriter
objects and rolls over to a new one once either the number of rows appended or the file size in bytes exceeds the target value.It's not finished as of yet, I am still trying to find a good way to access the current file from the underlying reader. I tried to obtain that information from the
ManifestWriter._writer.output_stream
object, but that is write-only.Any pointers on how to access the current file size of the manifest writer would help me a lot:)