-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Fuzzing, Enforce Sufficient Buffer Size on Decompress, Calc Memory Usage #144
Improve Fuzzing, Enforce Sufficient Buffer Size on Decompress, Calc Memory Usage #144
Conversation
…for easier debugging.
Thank you for all the work you're contributing. I will review it as soon as I can. |
Don't worry, it's all good; we're all busy people after all :p |
Thinking about this some more. I initially started fuzzing with But as I've been reading through the code, some things came to mind. I originally experienced: Because
i.e. We applied RLE encoding which led to negative compression ratio in this step. With 187b322, this is no longer possible on new files. lzp/rle are not used if the size of the input data for rle or previous step (input data OR rle) for lzp is exceeded. The And for the case of <64 bytes, during decompression, the data would always shrink, by definition, since we append a crc32 and t hen This leaves the BWT transform and arithmetic coding; these together are applied regardless of whether they were successful or not. The value of In any case, with that in mind; a potential improvement for a newer version of the format would be encoding whether bwt+arithmetic was successful as an additional bit in the With that, you should not only get faster decode on extremely poorly compressible data; but the buffer would no longer require to be extended beyond In my case, I'd love to directly decompress into a buffer for a memory mapped file; currently it's not possible, but with this potential future improvement it would be. Right now, an extra malloc and 2*memcpy (to allocated buffer and back to mmap pointer) are required as of current. |
I studied up a bit since; since I'm not a compression enthusiast so I don't know everything in the world of compression off the top of my head. Assuming libsais does a standard BWT permutation, there's no size overhead there. We just store a BWT permutation index in the block, and reverse the transform by using it. Last step Unfortunately, we also have to account for older files where On that note, in my specific use case, I think I could determine if decompressing a block would run over struct BlockHeader {
uint32 crc32;
uint32 bwtIndex;
uint8 model;
// Optional size fields based on compression flags
if(model & 0x02)
uint32 lzpSize; // Size after LZP compression
if(model & 0x04)
uint32 rleSize; // Size after RLE compression
} Simply assert that:
If both are true then by definition the thing fed into (BWT+BCM) at encode stage could not exceed |
919e96c
to
269133b
Compare
I'll mark as ready to review when ready, in any case, still experimenting. |
196ac88
to
0deefe0
Compare
…th insufficient buffer.
0deefe0
to
a95e085
Compare
…orig_size_sufficient_for_decode` API
9bbfe08
to
23fb30c
Compare
4ddcb9d
to
510c4ca
Compare
078abd6
to
5174b4e
Compare
…ranches as unlikely.
I think I'm happy enough for now. Just doing a final round of fuzzing. Fuzzing is quite fun; working with this repo has actually been my first time with a proper fuzzer; which is surprising given that I like to go low level. Probably mainly because I grew up in C# land, not much fuzzing tooling on over there; before I'd always just brute force inputs in process; then force stop on first failing test case. AFL's been quite fun to work with, and fast. I even managed to find an out of bounds read in one of the decompress routines via fuzzing, nice. There is always a bit of looming uncertainty, always, around memory safety and breaking changes. Since the project has no test suite that I can see (and no coverage), doing any modification feels like playing with fire. There's always a risk of missing an edge case here or there. You don't really want to be fuzzing every time to see if something got borked by a risky change. Ideally you want to make a test around every edge case found by fuzzing, make sure you never break them, and then go from there. Write sufficient tests till there's coverage on all lines of code. If fuzzing breaks something, make a test case for it; add it to test suite. Repeat till stability. If it was an environment I'm more familiar with, I would have added one or two sanity tests, but I don't know people's preferences regarding test suites for C (or C++) as I very, very, very rarely touch them. For native stuff I usually do Rust (w/ Unsafe and Min Code Size/Dependencies); to max out the performance. So I don't know whether you'd want an autotools based approach, or otherwise. In any case, I made Also appears to pass with AddressSanitizer so far. [Running fuzz with asan is how I actually found out the out of bounds in decode] |
Stochastic testing (
The problem with that is that it would require yet another auxiliary buffer because we need to store three things at once: pre-bwt block, post-bwt block and post-AC block. That said we could probably undo BWT during encoding if the AC size is too large. I will consider this for the second version of the format. Some ideas are already in #106.
Yes, I can see that. How I test my changes is a corpus of a variety of files produced by bzip3 at various checkpoints in history and then try to (de-, re-)compress them. Testing compression software is difficult because even if you hit 100% coverage (near-impossible) various bug conditions are hiding specifically outside of branch conditions :-). The only way to test the tool is by compressing, decompressing and recompressing a very large (>50GB) corpus of files which I am not going to upload to GitHub due to practicality issues. You're welcome to collect your own corpus. Anyway, I think that I am done with my comments on this. Looks good. Thanks. |
That was pretty much what I had in mind when I was thinking of the situation/problem. The case should be rare enough that in practice it would never ever be hit, so even on a huge corpus of data, the difference in encode would be immesurable. The only way you're realistically ever going to hit it is probably encrypted data, where entropy is high and the data is avalanched by design. |
My current plans are:
|
If there's a list of ideas anywhere for future format and/or library iterations, here's some random things that came to mind (in addition to the above). Bear in mind, all of these are fairly niche. Was meant to post this yesterday, but am setting up for travel. 1. Marking branches as likely/unlikely.Lines 214 to 215 in 47e8322
I've done it here for example. Essentially you'd run the algorithms under a large corpus, look at the branching statistics for the functions and make the appropriate markings on the functions. I am aware of Profile Guided Optimization, I use it in some projects, but there will be setups/people whom may not apply it in their projects. In some cases, applying PGO may also be non-trivial; e.g. bindings from different languages. In the case of the branch above: Lines 214 to 215 in 47e8322
That technically added 2 instructions to a loop/hot path. We could technically check if this will hold true before running the loop, and if it won't, run a copy of the loop without this. i.e. hoist the check out of the loop and clone it. 2. Skipping CRCAlthough it's <1% of the runtime and only 4 bytes, the CRC may not be desired under some use cases. For example, when storing BZip3 encoded block(s) in an external archive format which has its own hashes. Outside of that, could always use a faster hashing implementation; ideally something unrolled. 3.
|
That barely does anything in many cases. I have considered this but ultimately I have not done that because the benefit was near-immeasurable.
I have a rationale for why I won't do that/replace it with a hardware CRC32C for now in the libbz3.c file. But we could yank e.g. https://github.com/kspalaiologos/xpar/blob/main/xpar-x86_64.asm#L101 from my other project.
Maybe, maybe not, I'm not sure how we'd make libsais behave. |
In general yes, there are many ways to improve this project and I know them all already. I have simply not been implementing them as of now, because some of the suggestions add platform-specific requirements or dependencies and I am not sure how distro maintainers would cope with that. |
It's all a matter of perspective really. Others, not so much, and would rather put resources elsewhere. For me, I'd take any win I can get. From my perspective, my thoughts were the following. (Parts copied from a set of notes I made) I came across the project after looking for something that gets high compression ratios for GPU textures (.dds & similar); as I'm working on a new archive format. [I spend most (pretty much all) of my spare time making open source libraries & tools for others to enjoy, no strings attached] So I initially tested on my 5900X with a large texture data set. BZip3 (16M blocks, 1.4.0, clang 18) decompressed at around 119.71 MB/s (on all threads). Assuming the compression ratio of 0.62 (my dataset), then the maximum broadband speed a user can have before getting bottlenecked is (0.62 * 119.71MB/s), i.e. 596.63464 Mbit/s. To achieve gigabit on this dataset, you'd need 1000 / 596.63464, an improvement of 1.676. Something around a 7950X. According to Steam HW Survey largest bucket of users has a 6-core CPU (presumably desktop), we can assume that to be something like R5 5600, so give-take half of my CPU; and slightly lower clock. So somewhere around 50-60MB/s full throttle decompression on all threads, and 10-ish MB/s on a single thread. For download speeds, Steam Download Stats we can see that the top countries average around 150Mbps today. Thankfully for most users here bz3 is still an improvement here, as give-take, a regular person should be able to decompress while downloading on 2 threads. But nonetheless, for the minority users, such as those with Gigabit connectivity, or those plugging in a laptop (lower powered) to an ethernet port with 300/500Mbit, using bz3 could be a regression over something like LZMA. Despite better compression ratios in certain types of data. So I came in here, started doing some PRs. My main concern was fuzzing BZip3 to ensure it's suitable for web downloads (i.e. users deliberately creating malformed data can't cause invalid memory access), since it's relatively new. Everything else I threw in as an extra. Secondary objective is picking up any easy performance win I can get, even if minor. I could try a bit harder one day, e.g. hand written assembly routines, but with so much things to do in my own projects, spending the time there would be a bit challenging. Might be worth trying libcubwt. I think that's the most significant possible improvement here at the moment. Unfortunate it's CUDA, as I don't like vendor lock-in, but it'd probably be a very noticeable improvement for most users. Even a low powered GPU could effectively be used as an additional 'thread' in compression/decompression. That sounds exciting. Hopefully the transforms it produces are compatible.
I'm traveling a bit (currently on a phone walking around) so I can't study it properly but it doesn't seem like the memory usage for libsais is that high, at least for the undo. So that's always something that could be left for later, e.g. PR it eventually. Readme claims 'typical 16KiB', though I wouldn't be so sure, maybe for another function. On cursory inspection. If I'm reading this right from my phone, undo seems to be at least 256KiB (bucket2) + 128K (fastbits) for a typical block of 16M https://github.com/IlyaGrebnov/libsais/blob/918b4903647d81294b19bd9b122807fd1144a37a/src/libsais.c#L7557 ., which at least for decompress should be statistically insignificant for now. So not a huge concern. |
It's not practical unless we have access to a GPU with a large amount of VRAM. Check the README. |
I did actually check the README when making the suggestion, it says:
I don't consider that impractical. iGPUs are already automatically excluded as it's CUDA. The README also states
(Note: It should say MiB there, I doublechecked with listed test cases) Since handling 2 of such files at once would have ran Ilya out of VRAM, it can be safely assumed that they only submitted 1 element at a time when testing everything in their test suite. That was still massively faster than CPU. In any case, because GPUs run SIMT, chances are, doing BWT on multiple blocks at a time wouldn't yield much benefits; so you probably only want 1. In which case, the typical VRAM requirement for a typical BZip3 compression run would just be that 328MiB mentioned above. |
Currently I don't have a suitable GPU to check this/try replicating the results. So maybe sometime in the future. |
I could try when I'm back home in a week, it's a bit older (7 years old) but I do have a 1080Ti. Never tried building with CUDA, but could try swapping out the BWT function and seeing what happens when I get back home. |
I might do a PR with CUDA supported BWT at some point. I got curious, so I built a dummy project with I cannot imagine how much faster modern cards would be at this. |
Interesting... |
This is a culmination of a few small patches:
A small rewrite of
fuzz.c
.Added fuzzer for block decodes (
fuzz-decode-block.c
).Added fuzzer for round trips (
fuzz-round-trip.c
).Added
bz3_memory_needed
API, which provides info about the expected memory usage of a call tobz3_new
; and common compression APIs used with it afterwards.Fixed an invalid out of bounds read in
lzp_decode_block
66cde0b . Caught by fuzzing onfuzz-decode-block.c
.New error type
BZ3_ERR_DATA_SIZE_TOO_SMALL
returns when an insufficient buffer size is passed tobz3_decode_block
.bz3_decode_block
.Added bounds checks to
bz3_decode_block
, to ensure we won't write outside of the user provided buffer.Updated documentation around
bz3_decode_block
[Check Edit History for Info for Earlier Commits if Desired]