Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Fuzzing, Enforce Sufficient Buffer Size on Decompress, Calc Memory Usage #144

Merged
merged 28 commits into from
Dec 15, 2024

Conversation

Sewer56
Copy link
Contributor

@Sewer56 Sewer56 commented Dec 13, 2024

This is a culmination of a few small patches:

  • A small rewrite of fuzz.c.

    • It can now be used to prepare its own test data for simpler setup.
    • Persistent Mode for faster fuzzing.
    • Quick info on fuzzing in parallel.
  • Added fuzzer for block decodes (fuzz-decode-block.c).

  • Added fuzzer for round trips (fuzz-round-trip.c).

    • Purpose is to make sure I don't break stuff :p
  • Added bz3_memory_needed API, which provides info about the expected memory usage of a call to bz3_new; and common compression APIs used with it afterwards.

    • Please confirm these are accurate.
  • Fixed an invalid out of bounds read in lzp_decode_block 66cde0b . Caught by fuzzing on fuzz-decode-block.c.

  • New error type BZ3_ERR_DATA_SIZE_TOO_SMALL returns when an insufficient buffer size is passed to bz3_decode_block.

    • Introduces an API breaking change by adding an extra parameter to bz3_decode_block.
    • However this breaking change prevents potential segfault on incorrect usage.
  • Added bounds checks to bz3_decode_block, to ensure we won't write outside of the user provided buffer.

  • Updated documentation around bz3_decode_block


[Check Edit History for Info for Earlier Commits if Desired]

@Sewer56 Sewer56 changed the title WIP: Improve Fuzzing WIP: Improve Fuzzing, Use Less Mem on Decompress, Calc Memory Usage Dec 13, 2024
@kspalaiologos
Copy link
Owner

Thank you for all the work you're contributing. I will review it as soon as I can.

@Sewer56
Copy link
Contributor Author

Sewer56 commented Dec 13, 2024

Don't worry, it's all good; we're all busy people after all :p
I'm still poking around, have an idea or two.

@Sewer56
Copy link
Contributor Author

Sewer56 commented Dec 13, 2024

Thinking about this some more.

I initially started fuzzing with bz3_decode_block_bound at orig_size + 256 as per comment

But as I've been reading through the code, some things came to mind.
When you get around, please let me know if I'm interpreting this right; some thoughts below.


I originally experienced:

Because rle_size after encode exceeded the original decompressed size of the data.
Which in turn led to a buffer overflow on the decompression buffer:

if (model & 2)
    size_src = lzp_size;
else if (model & 4)
    size_src = rle_size; // this is picked, but is bigger than orig_size
else
    size_src = orig_size;
    
// later on
// b2 is 'buffer'; though currently out of date swap buffer.
memset(b2, 0, size_src);

i.e. We applied RLE encoding which led to negative compression ratio in this step.

With 187b322, this is no longer possible on new files. lzp/rle are not used if the size of the input data for rle or previous step (input data OR rle) for lzp is exceeded. The memset is right after, in which case the original issue I encountered is not reproducible as of that commit.

And for the case of <64 bytes, during decompression, the data would always shrink, by definition, since we append a crc32 and t hen

This leaves the BWT transform and arithmetic coding; these together are applied regardless of whether they were successful or not.

The value of bz3_decode_block_bound should be the worst case possible size increase of BWT transform and AC, if I'm understanding this right (and lzp/rle for files made with earlier encoders). I don't know the exact worst case for these yet.


In any case, with that in mind; a potential improvement for a newer version of the format would be encoding whether bwt+arithmetic was successful as an additional bit in the model field.

With that, you should not only get faster decode on extremely poorly compressible data; but the buffer would no longer require to be extended beyond orig_size. Only in the case where source > destination in size.

In my case, I'd love to directly decompress into a buffer for a memory mapped file; currently it's not possible, but with this potential future improvement it would be. Right now, an extra malloc and 2*memcpy (to allocated buffer and back to mmap pointer) are required as of current.

@Sewer56
Copy link
Contributor Author

Sewer56 commented Dec 13, 2024

The value of bz3_decode_block_bound should be the worst case possible size increase of BWT transform and AC, if I'm understanding this right (and lzp/rle for files made with earlier encoders). I don't know the exact worst case for these yet.

I studied up a bit since; since I'm not a compression enthusiast so I don't know everything in the world of compression off the top of my head.

Assuming libsais does a standard BWT permutation, there's no size overhead there. We just store a BWT permutation index in the block, and reverse the transform by using it. Last step encode_bytes looks like BCM predictor/coder (based on arithmetic coding). I'm not sure on the worst case of this, however, would need a lot more reading. But for current encoder, bz3_decode_block_bound would be the worst possible outcome of the last BCM coding step.

Unfortunately, we also have to account for older files where rleSize or lzpSize could theoretically be larger than origSize.


On that note, in my specific use case, I think I could determine if decompressing a block would run over orig_size by examining its header.

struct BlockHeader {
    uint32 crc32; 
    uint32 bwtIndex;
    uint8  model;
    
    // Optional size fields based on compression flags
    if(model & 0x02)     
        uint32 lzpSize;   // Size after LZP compression
    if(model & 0x04)     
        uint32 rleSize;   // Size after RLE compression
}

Simply assert that:

  • lzpSize <= orig_size
  • rleSize <= orig_size

If both are true then by definition the thing fed into (BWT+BCM) at encode stage could not exceed orig_size.
Alas, it should be possible to pass buffer with orig_size bytes into bz3_decode_block safely under those conditions.

@Sewer56 Sewer56 force-pushed the add-bz3-compress-mem branch from 919e96c to 269133b Compare December 13, 2024 22:25
@Sewer56 Sewer56 changed the title WIP: Improve Fuzzing, Use Less Mem on Decompress, Calc Memory Usage WIP: Improve Fuzzing, Check Sufficient Buffer Size on Decompress, Calc Memory Usage Dec 13, 2024
@Sewer56 Sewer56 changed the title WIP: Improve Fuzzing, Check Sufficient Buffer Size on Decompress, Calc Memory Usage WIP: Improve Fuzzing, Enforce Sufficient Buffer Size on Decompress, Calc Memory Usage Dec 13, 2024
@Sewer56
Copy link
Contributor Author

Sewer56 commented Dec 13, 2024

I'll mark as ready to review when ready, in any case, still experimenting.

@Sewer56 Sewer56 force-pushed the add-bz3-compress-mem branch from 196ac88 to 0deefe0 Compare December 14, 2024 00:08
@Sewer56 Sewer56 force-pushed the add-bz3-compress-mem branch from 0deefe0 to a95e085 Compare December 14, 2024 00:16
@Sewer56 Sewer56 force-pushed the add-bz3-compress-mem branch from 9bbfe08 to 23fb30c Compare December 14, 2024 01:50
@Sewer56 Sewer56 force-pushed the add-bz3-compress-mem branch from 4ddcb9d to 510c4ca Compare December 14, 2024 03:30
@Sewer56 Sewer56 force-pushed the add-bz3-compress-mem branch from 078abd6 to 5174b4e Compare December 14, 2024 04:41
@Sewer56
Copy link
Contributor Author

Sewer56 commented Dec 14, 2024

I think I'm happy enough for now. Just doing a final round of fuzzing. Fuzzing is quite fun; working with this repo has actually been my first time with a proper fuzzer; which is surprising given that I like to go low level. Probably mainly because I grew up in C# land, not much fuzzing tooling on over there; before I'd always just brute force inputs in process; then force stop on first failing test case. AFL's been quite fun to work with, and fast.

I even managed to find an out of bounds read in one of the decompress routines via fuzzing, nice.
And learned some new compression related topics, e.g. how BWT works.


There is always a bit of looming uncertainty, always, around memory safety and breaking changes.

Since the project has no test suite that I can see (and no coverage), doing any modification feels like playing with fire. There's always a risk of missing an edge case here or there. You don't really want to be fuzzing every time to see if something got borked by a risky change.

Ideally you want to make a test around every edge case found by fuzzing, make sure you never break them, and then go from there. Write sufficient tests till there's coverage on all lines of code. If fuzzing breaks something, make a test case for it; add it to test suite. Repeat till stability.

If it was an environment I'm more familiar with, I would have added one or two sanity tests, but I don't know people's preferences regarding test suites for C (or C++) as I very, very, very rarely touch them. For native stuff I usually do Rust (w/ Unsafe and Min Code Size/Dependencies); to max out the performance.

So I don't know whether you'd want an autotools based approach, or otherwise.


In any case, I made bz3_decode_block relatively safe, with the fuzzer for block decodes I added.

image

Also appears to pass with AddressSanitizer so far.

image

[Running fuzz with asan is how I actually found out the out of bounds in decode]

@Sewer56 Sewer56 marked this pull request as ready for review December 14, 2024 06:46
@Sewer56 Sewer56 changed the title WIP: Improve Fuzzing, Enforce Sufficient Buffer Size on Decompress, Calc Memory Usage Improve Fuzzing, Enforce Sufficient Buffer Size on Decompress, Calc Memory Usage Dec 14, 2024
@kspalaiologos
Copy link
Owner

This leaves the BWT transform and arithmetic coding; these together are applied regardless of whether they were successful or not.
The value of bz3_decode_block_bound should be the worst case possible size increase of BWT transform and AC, if I'm understanding this right (and lzp/rle for files made with earlier encoders). I don't know the exact worst case for these yet.

Stochastic testing (/dev/urandom) data shows that bz3_bound is accurate and more than generous (the expansion generally models a distribution from which we make assumptions about the maximum expansion). Deriving a concrete closed form from the code alone is probably not going to be very helpful.

In any case, with that in mind; a potential improvement for a newer version of the format would be encoding whether bwt+arithmetic was successful as an additional bit in the model field.

The problem with that is that it would require yet another auxiliary buffer because we need to store three things at once: pre-bwt block, post-bwt block and post-AC block. That said we could probably undo BWT during encoding if the AC size is too large. I will consider this for the second version of the format. Some ideas are already in #106.

There is always a bit of looming uncertainty, always, around memory safety and breaking changes.
Since the project has no test suite that I can see (and no coverage), doing any modification feels like playing with fire. There's always a risk of missing an edge case here or there. You don't really want to be fuzzing every time to see if something got borked by a risky change.

Yes, I can see that. How I test my changes is a corpus of a variety of files produced by bzip3 at various checkpoints in history and then try to (de-, re-)compress them. Testing compression software is difficult because even if you hit 100% coverage (near-impossible) various bug conditions are hiding specifically outside of branch conditions :-). The only way to test the tool is by compressing, decompressing and recompressing a very large (>50GB) corpus of files which I am not going to upload to GitHub due to practicality issues. You're welcome to collect your own corpus.

Anyway, I think that I am done with my comments on this. Looks good. Thanks.

@Sewer56
Copy link
Contributor Author

Sewer56 commented Dec 14, 2024

That said we could probably undo BWT during encoding if the AC size is too large. I will consider this for the second version of the format.

That was pretty much what I had in mind when I was thinking of the situation/problem. The case should be rare enough that in practice it would never ever be hit, so even on a huge corpus of data, the difference in encode would be immesurable.

The only way you're realistically ever going to hit it is probably encrypted data, where entropy is high and the data is avalanched by design.

@kspalaiologos
Copy link
Owner

My current plans are:

@Sewer56
Copy link
Contributor Author

Sewer56 commented Dec 15, 2024

If there's a list of ideas anywhere for future format and/or library iterations, here's some random things that came to mind (in addition to the above). Bear in mind, all of these are fairly niche.

Was meant to post this yesterday, but am setting up for travel.

1. Marking branches as likely/unlikely.

bzip3/src/libbz3.c

Lines 214 to 215 in 47e8322

// SAFETY: 'in' is advanced here, but it may have been at last index in the case of untrusted bad data.
if (UNLIKELY(in == in_end)) return -1;

I've done it here for example.

Essentially you'd run the algorithms under a large corpus, look at the branching statistics for the functions and make the appropriate markings on the functions. I am aware of Profile Guided Optimization, I use it in some projects, but there will be setups/people whom may not apply it in their projects. In some cases, applying PGO may also be non-trivial; e.g. bindings from different languages.

In the case of the branch above:

bzip3/src/libbz3.c

Lines 214 to 215 in 47e8322

// SAFETY: 'in' is advanced here, but it may have been at last index in the case of untrusted bad data.
if (UNLIKELY(in == in_end)) return -1;

That technically added 2 instructions to a loop/hot path. We could technically check if this will hold true before running the loop, and if it won't, run a copy of the loop without this. i.e. hoist the check out of the loop and clone it.

2. Skipping CRC

Although it's <1% of the runtime and only 4 bytes, the CRC may not be desired under some use cases. For example, when storing BZip3 encoded block(s) in an external archive format which has its own hashes.

Outside of that, could always use a faster hashing implementation; ideally something unrolled.
For large data, xxh3 is pretty good because it's a stable algorithm with HW acceleration on some architectures.
In that case, even on 32-bit builds you could make use of SSE; and just merge (fold) the top and bottom half together if 32-bit hash is desired.

3. bz3_new / bz3_free with pointer to custom allocator.

So you can allocate the bulk of the memory in a Memory Pool, etc.

When archiving BZip3 files into a single container/archive, you'll have files which all use different block sizes; and others which may use different algorithms.

While you could pool the state, i.e. bz3_new, doing that sort of thing for different algorithm would blow up memory usage, and decrease performance, as memory regions would require paging in an out of cache.

So being able to allocate/deallocate in own allocator would be helpful. Especially if you can precalculate max memory needed by taking max of bz3_memory_needed, and other APIs; then multiply by thread count.

@kspalaiologos
Copy link
Owner

kspalaiologos commented Dec 15, 2024

  1. Marking branches as likely/unlikely.

That barely does anything in many cases. I have considered this but ultimately I have not done that because the benefit was near-immeasurable.

  1. Skipping CRC

I have a rationale for why I won't do that/replace it with a hardware CRC32C for now in the libbz3.c file. But we could yank e.g. https://github.com/kspalaiologos/xpar/blob/main/xpar-x86_64.asm#L101 from my other project.

  1. bz3_new / bz3_free with pointer to custom allocator.

Maybe, maybe not, I'm not sure how we'd make libsais behave.

@kspalaiologos
Copy link
Owner

In general yes, there are many ways to improve this project and I know them all already. I have simply not been implementing them as of now, because some of the suggestions add platform-specific requirements or dependencies and I am not sure how distro maintainers would cope with that.

@kspalaiologos kspalaiologos merged commit 2b2c18c into kspalaiologos:master Dec 15, 2024
34 checks passed
@Sewer56
Copy link
Contributor Author

Sewer56 commented Dec 15, 2024

It's all a matter of perspective really.
Some people consider ~1% a significant improvement and would fight tooth and nail for it.

Others, not so much, and would rather put resources elsewhere. For me, I'd take any win I can get.

From my perspective, my thoughts were the following. (Parts copied from a set of notes I made)

I came across the project after looking for something that gets high compression ratios for GPU textures (.dds & similar); as I'm working on a new archive format.

[I spend most (pretty much all) of my spare time making open source libraries & tools for others to enjoy, no strings attached]

So I initially tested on my 5900X with a large texture data set. BZip3 (16M blocks, 1.4.0, clang 18) decompressed at around 119.71 MB/s (on all threads).

Assuming the compression ratio of 0.62 (my dataset), then the maximum broadband speed a user can have before getting bottlenecked is (0.62 * 119.71MB/s), i.e. 596.63464 Mbit/s. To achieve gigabit on this dataset, you'd need 1000 / 596.63464, an improvement of 1.676. Something around a 7950X.

According to Steam HW Survey largest bucket of users has a 6-core CPU (presumably desktop), we can assume that to be something like R5 5600, so give-take half of my CPU; and slightly lower clock. So somewhere around 50-60MB/s full throttle decompression on all threads, and 10-ish MB/s on a single thread.

For download speeds, Steam Download Stats we can see that the top countries average around 150Mbps today.

Thankfully for most users here bz3 is still an improvement here, as give-take, a regular person should be able to decompress while downloading on 2 threads.

But nonetheless, for the minority users, such as those with Gigabit connectivity, or those plugging in a laptop (lower powered) to an ethernet port with 300/500Mbit, using bz3 could be a regression over something like LZMA. Despite better compression ratios in certain types of data.

So I came in here, started doing some PRs. My main concern was fuzzing BZip3 to ensure it's suitable for web downloads (i.e. users deliberately creating malformed data can't cause invalid memory access), since it's relatively new. Everything else I threw in as an extra.

Secondary objective is picking up any easy performance win I can get, even if minor. I could try a bit harder one day, e.g. hand written assembly routines, but with so much things to do in my own projects, spending the time there would be a bit challenging.

Might be worth trying libcubwt. I think that's the most significant possible improvement here at the moment. Unfortunate it's CUDA, as I don't like vendor lock-in, but it'd probably be a very noticeable improvement for most users. Even a low powered GPU could effectively be used as an additional 'thread' in compression/decompression. That sounds exciting. Hopefully the transforms it produces are compatible.


libsais memory usage

I'm traveling a bit (currently on a phone walking around) so I can't study it properly but it doesn't seem like the memory usage for libsais is that high, at least for the undo. So that's always something that could be left for later, e.g. PR it eventually.

Readme claims 'typical 16KiB', though I wouldn't be so sure, maybe for another function. On cursory inspection. If I'm reading this right from my phone, undo seems to be at least 256KiB (bucket2) + 128K (fastbits) for a typical block of 16M https://github.com/IlyaGrebnov/libsais/blob/918b4903647d81294b19bd9b122807fd1144a37a/src/libsais.c#L7557 ., which at least for decompress should be statistically insignificant for now. So not a huge concern.

@kspalaiologos
Copy link
Owner

Might be worth trying libcubwt. I think that's the most significant possible improvement here at the moment. Unfortunate it's CUDA, as I don't like vendor lock-in, but it'd probably be a very noticeable improvement for most users. Even a low powered GPU could effectively be used as an additional 'thread' in compression/decompression. That sounds exciting. Hopefully the transforms it produces are compatible.

It's not practical unless we have access to a GPU with a large amount of VRAM. Check the README.

@Sewer56
Copy link
Contributor Author

Sewer56 commented Dec 16, 2024

I did actually check the README when making the suggestion, it says:

20.5n bytes of GPU memory

I don't consider that impractical.
Default block size for bzip3 is 16M, 20.5 times that is 328MiB. You'd be hard pressed to find someone with this little VRAM nowadays. And that's the worst case, since BWT comes after RLE+LZP.

iGPUs are already automatically excluded as it's CUDA. The README also states

Input files were capped at 352MB due to GPU memory limit.

(Note: It should say MiB there, I doublechecked with listed test cases)

Since handling 2 of such files at once would have ran Ilya out of VRAM, it can be safely assumed that they only submitted 1 element at a time when testing everything in their test suite. That was still massively faster than CPU.

In any case, because GPUs run SIMT, chances are, doing BWT on multiple blocks at a time wouldn't yield much benefits; so you probably only want 1. In which case, the typical VRAM requirement for a typical BZip3 compression run would just be that 328MiB mentioned above.

@kspalaiologos
Copy link
Owner

Currently I don't have a suitable GPU to check this/try replicating the results. So maybe sometime in the future.

@Sewer56
Copy link
Contributor Author

Sewer56 commented Dec 16, 2024

I could try when I'm back home in a week, it's a bit older (7 years old) but I do have a 1080Ti.

Never tried building with CUDA, but could try swapping out the BWT function and seeing what happens when I get back home.

@Sewer56
Copy link
Contributor Author

Sewer56 commented Jan 8, 2025

I might do a PR with CUDA supported BWT at some point.
Not now, as I've got other projects in the pipeline (e.g. lossless BC7 texture transform), but down the road.

I got curious, so I built a dummy project with libcubwt. Took a short while, but I got it working.
When I benched transform, the 1080 Ti (2017) I have laying around was around 8 times faster than 5900X (2020) on all 12 cores, 24 threads. That was quite surprising 😅.

I cannot imagine how much faster modern cards would be at this.

@kspalaiologos
Copy link
Owner

Interesting...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants