data integrity failure on truncated stream #106

kilobyte · 2023-05-23T07:09:44Z

Unlike other Unix compressors, bzip3 fails to notice data truncation if the compressed stream ends at a block boundary. There's no way to distinguish such a truncation, leading to silent data loss.

Furthermore, while compressed block boundaries are "random" by length, the timing pattern makes it very likely such truncations happen naturally, with no malice involved. The library writes a series of blocks, takes a long while processing a new series, and only then resumes output. Thus, any mishap (crash, power loss, a network failure, a pendrive being ejected, a backup snapshot, OOM, a timeout, etc) will very likely make the file appear to be correctly terminated. This is compounded by the tool forcing a flush at a block boundary — something normally beneficial due to cache locality, but here the block tail stuck in stdio buffers would at least make the error noisy.

Alas, while it'd be easy to add such a marker (a block header with length=0 or a magic value >511MB), any such change would break bytestream compat, thus breaking compatibility with current version of the library.

kspalaiologos · 2023-05-23T07:19:00Z

One way to prevent this situation from happening would be immediately testing the file using bzip3 -t to determine if the compressed size matches the decompressed size.

kilobyte · 2023-05-23T07:22:21Z

There's no record of the decompressed size anywhere; in fact there's even no way to know it beforehand if the input comes from a pipe or /proc.

kspalaiologos · 2024-12-15T11:09:18Z

Moved to the tracking issue #145.

Sewer56 · 2024-12-15T14:01:58Z

It took me a while to understand this issue funnily enough, since I haven't been a long time Linux user, at least in my programming era. Being sleepy may contribute.

At first, I was briefly confused because I extensively tested creating malformed bz3 files manually (before fuzzing) the other day. But this has to do with truncated valid data, not invalid.

Anyway, for anyone else curious, here's the issue explained in simple terms.

BZip3 can accept data to be compressed in a 'streaming' fashion, e.g. stdin (piping)
Length of this stream is unknown from CLI tool's perspective.
So the Block Count is omitted from the CLI file header.
Therefore 'end of file' is assumed to be end of data.

However, a crash (power loss, machine, process) could lead to less than the full file written out.

Because BZip3 writes files out block by block, an incomplete compression operation would still produce a valid file, albeit truncated.

Issue proposes adding a special value to signify end of file.

In any case, I've not worked with pipes before, but I believe you generally determine EOF by checking feof after a 0 byte read.

For the purpose of consistency would it not be better to unify the frame format and file format in a future version of bzip3 format?

In the spec we could define a block count of 0 as invalid. The CLI process could write a block count of 0 initially, count the blocks as they're processed, then insert the correct count once EOF has been reached.

kspalaiologos · 2024-12-15T14:08:35Z

The CLI process could write a block count of 0 initially, count the blocks as they're processed, then insert the correct count once EOF has been reached.

Doesn't work if the output is a pipe.

Sewer56 · 2024-12-15T14:14:05Z

Ah, bummer. Yeah, that does make sense since that'd be a streamed, non-seekable output. I forgot about that for a sec.

In this case, a terminator makes sense.
Could 'block count 0' maybe be redefined as 'read until end marker'?

That could maybe make the format more interoperable. Someone could make a file using the CLI, and another person could read it using the frame API (using bz3 as a library). And vice-versa.

kspalaiologos · 2024-12-15T14:34:28Z

Yeah.

synodriver mentioned this issue Jul 8, 2023

About multithreading compress/decompress synodriver/python-bz3#11

Open

kspalaiologos added the enhancement New feature or request label Jan 13, 2024

kspalaiologos mentioned this issue Dec 14, 2024

Improve Fuzzing, Enforce Sufficient Buffer Size on Decompress, Calc Memory Usage #144

Merged

kspalaiologos closed this as completed Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data integrity failure on truncated stream #106

data integrity failure on truncated stream #106

kilobyte commented May 23, 2023

kspalaiologos commented May 23, 2023

kilobyte commented May 23, 2023

kspalaiologos commented Dec 15, 2024

Sewer56 commented Dec 15, 2024

kspalaiologos commented Dec 15, 2024

Sewer56 commented Dec 15, 2024

kspalaiologos commented Dec 15, 2024

data integrity failure on truncated stream #106

data integrity failure on truncated stream #106

Comments

kilobyte commented May 23, 2023

kspalaiologos commented May 23, 2023

kilobyte commented May 23, 2023

kspalaiologos commented Dec 15, 2024

Sewer56 commented Dec 15, 2024

kspalaiologos commented Dec 15, 2024

Sewer56 commented Dec 15, 2024

kspalaiologos commented Dec 15, 2024