diff --git a/doc/bzip3_format.md b/doc/bzip3_format.md new file mode 100644 index 0000000..be71703 --- /dev/null +++ b/doc/bzip3_format.md @@ -0,0 +1,180 @@ +# BZip3 Format Specification + +Version 1 + +## Headers + +The File and Frame formats share a similar structure, differing only in whether they include a +block count field. + +### File Header + +``` ++----------------+------------------+--------------------+ +| Header | Chunk 1 | Chunk 2 | +| (9 bytes) | (variable size) | (variable size) | ++----------------+------------------+--------------------+ +``` + +This is created by the CLI tool. + +### Frame Header + +``` ++----------------+------------------+--------------------+ +| Header | Chunk 1 | Chunk 2 | +| (13 bytes) | (variable size) | (variable size) | ++----------------+------------------+--------------------+ +``` + +This is created/read by `bz3_compress` and `bz3_decompress`. + +### Header Structure + +| Field | Type | Description | File Header | Frame Header | +| -------------- | ------ | ------------------------------- | ----------- | ------------ | +| Signature | u8[5] | Fixed "BZ3v1" ASCII string | ✓ | ✓ | +| Max Block Size | u32_le | Maximum decompressed block size | ✓ | ✓ | +| Block Count | u32_le | Number of blocks in the stream | ✗ | ✓ | + +### Validation Rules + +1. **Signature**: Must exactly match "BZ3v1" +2. **Max Block Size**: + - Minimum: 65KiB (66,560 bytes) + - Maximum: 511MiB (535,822,336 bytes) +3. **Block Count** (Frame Format only): + - Must match the actual number of blocks in the stream + - Should be greater than 0 + +### Example Parser + +```c +typedef struct { + uint32_t max_block_size; + uint32_t block_count; // Frame Format only +} bzip3_header_t; + +bool read_bzip3_header(FILE* fp, bzip3_header_t* header, bool is_frame_format) { + char signature[6] = {0}; + + // Read signature + if (fread(signature, 1, 5, fp) != 5) + return false; + + if (strcmp(signature, "BZ3v1") != 0) + return false; + + // Read max block size + uint8_t size_bytes[4]; + if (fread(size_bytes, 1, 4, fp) != 4) + return false; + + header->max_block_size = read_neutral_s32(size_bytes); + + if (header->max_block_size < 65536 || + header->max_block_size > 535822336) + return false; + + // Read block count if Frame Format + if (is_frame_format) { + uint8_t count_bytes[4]; + if (fread(count_bytes, 1, 4, fp) != 4) + return false; + + header->block_count = read_neutral_s32(count_bytes); + + if (header->block_count == 0) + return false; + } + + return true; +} +``` + +The integers in BZip3 are written unaligned, in little endian format. +A portable implementation is below. + +```c +// Reading a 32-bit integer +static s32 read_neutral_s32(u8 * data) { + return ((u32)data[0]) | + (((u32)data[1]) << 8) | + (((u32)data[2]) << 16) | + (((u32)data[3]) << 24); +} + +// Writing a 32-bit integer +static void write_neutral_s32(u8 * data, s32 value) { + data[0] = value & 0xFF; + data[1] = (value >> 8) & 0xFF; + data[2] = (value >> 16) & 0xFF; + data[3] = (value >> 24) & 0xFF; +} +``` + +## Block Format + +After the header, both File and Frame formats contain a sequence of blocks that follow the Block +Format specification. Each block is encapsulated in a chunk structure that defines its size. + +The blocks (***without chunk header***) can be encoded/decoded using the `bz3_encode_block` +and `bz3_decode_block` APIs. + +### Chunk Structure + +```c +// Main block structure +struct Chunk { + u32_le compressedSize; // Size of compressed block + u32_le origSize; // Original uncompressed size + + if (origSize < 64) { + SmallBlock block; + } else { + Block block; + } +}; +``` + +### Small Block Format (< 64 bytes) + +For blocks smaller than 64 bytes, no compression is attempted. The data is stored with just a checksum: + +```c +struct SmallBlock { + u32_le crc32; // CRC32 checksum + u32_le literal; // Always 0xFFFFFFFF for small blocks. This is basically an invalid `bwtIndex` + u8 data[parent.compressedSize - 8]; // Uncompressed data +}; +``` + +### Regular Block Format (≥ 64 bytes) + +Larger blocks use a more complex format that supports multiple compression features: + +```c +struct Block { + u32_le crc32; // CRC32 checksum of uncompressed data + u32_le bwtIndex; // Burrows-Wheeler transform index + u8 model; // Compression model flags + + if ((model & 0x02) != 0) + u32_le lzpSize; // Size after LZP compression + if ((model & 0x04) != 0) + u32_le rleSize; // Size after RLE compression + + u8 data[parent.compressedSize - (popcnt(model) * 4 + 9)]; +}; +``` + +#### Compression Model + +The `model` byte in regular blocks indicates which compression features were used: + +- `0x02`: LZP (Lempel Ziv Prediction) filter +- `0x04`: RLE (Run-Length Encoding) filter + +## External Resources + +- [BZip3 Pattern for ImHex](https://github.com/WerWolv/ImHex-Patterns/pull/329) diff --git a/doc/file_format.md b/doc/file_format.md deleted file mode 100644 index 43b66ca..0000000 --- a/doc/file_format.md +++ /dev/null @@ -1,21 +0,0 @@ - -# The bzip3 file format - -Each bzip3-compressed file starts with the marker `BZ3v1`. After the signature, the compressor encodes a 32-bit number signifying the maximum block size in bytes in the file. As such, no block after decompression in the stream can exceed it. The maximum block size must be between 65KiB and 511MiB. - -The following functions are used for serialising all 32-bit numbers to the archive: - -```c -static s32 read_neutral_s32(u8 * data) { - return ((u32)data[0]) | (((u32)data[1]) << 8) | (((u32)data[2]) << 16) | (((u32)data[3]) << 24); -} - -static void write_neutral_s32(u8 * data, s32 value) { - data[0] = value & 0xFF; - data[1] = (value >> 8) & 0xFF; - data[2] = (value >> 16) & 0xFF; - data[3] = (value >> 24) & 0xFF; -} -``` - -After the file header, the bzip3-compressed file contains a series of independent blocks compressed using the low level API. \ No newline at end of file diff --git a/doc/high_level_format.md b/doc/high_level_format.md deleted file mode 100644 index 7f05ff1..0000000 --- a/doc/high_level_format.md +++ /dev/null @@ -1,4 +0,0 @@ - -# High level API bzip3 frame format. - -The bzip3 frame format is a concatenation of bzip3-compressed blocks. It's used exclusively by the `bz3_compress` and `bz3_decompress` functions and will not work with the command-line tool or low level functions. Each frame start with the ASCII "BZ3v1" signature, followed by the 32-bit maximum block size in bytes and the 32-bit amount of blocks in the frame. After the 13 byte header, a sequence of independent blocks encoded using the low level API follows. diff --git a/doc/low_level_format.md b/doc/low_level_format.md deleted file mode 100644 index e84d7cb..0000000 --- a/doc/low_level_format.md +++ /dev/null @@ -1,14 +0,0 @@ - -# Low level API bzip3 block format. - -Each chunk starts with the _new_ size - a 32-bit integer signifying the _compressed_ size of the block, and the _old_ size - a 32-bit integer signifying the _decompressed_ size. Then, a sequence of bzip3-compressed data follows. CRC32 checking is left up to libbz3. - -If the chunk is smaller than 64 bytes, then compression is not attempted. Instead, the content is prepended with the 32-bit CRC32 checksum and a 0xFFFFFFFF literal. - -Otherwise, the chunk starts with the 32-bit CRC32 checksum value, the Burrows-Wheeler transform permutation index and the compression _model_ - a 8-bit value specifying the compression preset used. As such: - -- 2-s bit set in the _model_ - LZP was used and the 32-bit size is prepended to the block. -- 4-s bit set in the _model_ - RLE was used and the 32-bit size is prepended to the block. -- No other bit can be set in the _model_. - -The size of libbz3's block header can be calculated using the formula `popcnt(model) * 4 + 9`. diff --git a/doc/overview.md b/doc/overview.md new file mode 100644 index 0000000..087a837 --- /dev/null +++ b/doc/overview.md @@ -0,0 +1,30 @@ +# BZip3 Format Documentation + +BZip3 is a modern compression format designed for high compression ratios while maintaining +reasonable decompression speeds. It is intended to provide similar compression ratio and +performance to LZMA and BZip2; as opposed to faster Lempel-Ziv codecs that usually offer worse +compression ratio like ZStandard or LZ4. + +This documentation covers the technical specifications of the BZip3 format. + +## Format Characteristics + +- Block level compression (no streams) +- Maximum block size ranges from 65KiB to 511MiB +- Memory usage of ~(6 x block size), both compression and decompression +- Little-endian encoding for integers +- Embedded CRC32 checksums for data integrity +- Combines LZP, RLE followed by Burrows-Wheeler transform and arithmetic coding coupled with + a statistical predictor. + +## Format Overview + +BZip3 uses two main top-level formats: + +1. **File Format**: The standard format used by the command-line tool +2. **Frame Format**: Used by the high-level API functions `bz3_compress` and `bz3_decompress`. + +These formats are very similar: the file format is a superset of the frame format and thus also +contains a block count field. + +See [bzip3_format.md](./bzip3_format.md) for more details.