-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data integrity failure on truncated stream #106
Comments
One way to prevent this situation from happening would be immediately testing the file using |
There's no record of the decompressed size anywhere; in fact there's even no way to know it beforehand if the input comes from a pipe or /proc. |
Moved to the tracking issue #145. |
It took me a while to understand this issue funnily enough, since I haven't been a long time Linux user, at least in my programming era. Being sleepy may contribute. At first, I was briefly confused because I extensively tested creating malformed bz3 files manually (before fuzzing) the other day. But this has to do with truncated valid data, not invalid. Anyway, for anyone else curious, here's the issue explained in simple terms.
However, a crash (power loss, machine, process) could lead to less than the full file written out. Because BZip3 writes files out block by block, an incomplete compression operation would still produce a valid file, albeit truncated. Issue proposes adding a special value to signify end of file. In any case, I've not worked with pipes before, but I believe you generally determine EOF by checking For the purpose of consistency would it not be better to unify the frame format and file format in a future version of bzip3 format? In the spec we could define a block count of 0 as invalid. The CLI process could write a block count of 0 initially, count the blocks as they're processed, then insert the correct count once EOF has been reached. |
Doesn't work if the output is a pipe. |
Ah, bummer. Yeah, that does make sense since that'd be a streamed, non-seekable output. I forgot about that for a sec. In this case, a terminator makes sense. That could maybe make the format more interoperable. Someone could make a file using the CLI, and another person could read it using the frame API (using bz3 as a library). And vice-versa. |
Yeah. |
Unlike other Unix compressors, bzip3 fails to notice data truncation if the compressed stream ends at a block boundary. There's no way to distinguish such a truncation, leading to silent data loss.
Furthermore, while compressed block boundaries are "random" by length, the timing pattern makes it very likely such truncations happen naturally, with no malice involved. The library writes a series of blocks, takes a long while processing a new series, and only then resumes output. Thus, any mishap (crash, power loss, a network failure, a pendrive being ejected, a backup snapshot, OOM, a timeout, etc) will very likely make the file appear to be correctly terminated. This is compounded by the tool forcing a flush at a block boundary — something normally beneficial due to cache locality, but here the block tail stuck in stdio buffers would at least make the error noisy.
Alas, while it'd be easy to add such a marker (a block header with length=0 or a magic value >511MB), any such change would break bytestream compat, thus breaking compatibility with current version of the library.
The text was updated successfully, but these errors were encountered: