You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This document lists all file formats that this project can validate, along with the depth of integrity checking performed.
Validation Depth Levels
Internally, validation has only two depths (see ValidationDepth enum in format_validation.zig):
Level
Description
Corruption Detection
Structure
Headers, magic bytes, offsets, bounds checking
⚠️ Payload corruption may go UNDETECTED
Full
Checksum verified, decompressed, or fully decoded
✅ Payload corruption WILL be detected
The Critical Distinction
Structure validation only checks that:
Magic bytes are correct
Header fields are valid
Offsets and sizes don't exceed file bounds
Container structure is well-formed (e.g., braces match, chunks don't overlap)
A random bit flip in the payload would NOT cause structural validation to fail.
Full validation means every byte was verified via one of:
Checksum/hash (CRC32, MD5, header checksums) - the math covers all bytes
Decompression (gzip, zlib, bzip2) - the algorithm verified the data while decompressing
Full decode (JPEG pixels decoded, XML fully parsed, audio samples rendered)
A random bit flip in the payload WOULD cause full validation to fail.
Display Labels in Tables Below
For clarity, the tables below use more specific labels:
Table Label
Maps To
Examples
⚠️ Stub
structural + WARN
Magic/size check only — format recognized but no real corruption detection
Structure
structural
Bounds checking, brace matching, header parsing
Checksum
full
CRC32 verified, MD5 verified, header checksum
Decompress
full
Gzip/zlib/bzip2 decompression succeeded
Full Decode
full
JPEG pixels decoded, audio samples rendered
Integrity
full
Database page checksums, full XML parse
Why Some Formats Cannot Achieve Full Validation
Every format that caps at Structure rather than Checksum or Full Decode has a specific technical reason. This table documents those reasons so that structural-only depth is never treated as "we just haven't gotten to it yet" — these are fundamental limitations of the formats themselves.
Format
Max Depth
Why Full Validation Is Impossible
VMDK
Structure
VMware's format contains zero checksums or hashes anywhere in the spec. Integrity relies entirely on structural consistency (magic, version, flag sanity, grain-size power-of-2, overhead bounds). The only corruption sentinel is a 4-byte newline-detection field (bytes 73–76) that catches FTP text-mode transfers — but a random bit flip in grain data is invisible. VMware's own consistency tools work the same way: they verify structural invariants, not data integrity. The optional Redundant Grain Directory (RGD) provides a second copy of the grain directory for crash recovery, but even that is a structural redundancy, not a checksum.
WIM/ESD
Structure
WIM has an optional integrity table (SHA-1 over ~10MB chunks), but it is absent in the vast majority of WIM files in the wild. When present, verifying it requires reading the entire multi-gigabyte file and computing SHA-1 over each chunk — prohibitively expensive for a file validator that processes hundreds of thousands of files. Our structural validation covers the 208-byte header (magic, version, flags, part numbers, reserved-zero fields, resource header offset bounds), which catches truncation, header corruption, and interrupted writes (WRITE_IN_PROGRESS flag). Future: if we add streaming hash verification for large-file mode, WIM could reach Checksum when the integrity table is present.
Toast
Structure
Roxio Toast disc images are essentially renamed ISO 9660 images, sometimes with an Apple Partition Map (APM) prefix. ISO 9660 has no internal checksums. The only cross-validation possible is comparing the PVD's declared volume space size against the actual file size (catches truncation). The Application Identifier field can confirm Toast provenance but doesn't verify data. Some Toast files are hybrid APM+ISO, but APM partition entries also lack checksums. This is a fundamental limitation inherited from the ISO 9660 and APM specs.
CDG
Structure
CD+Graphics is a raw dump of subchannel data from audio CDs — a flat stream of 24-byte packets with no file header, no magic bytes, no checksums, and no framing. The parity fields (bytes 2–3 and 20–23 of each packet) are physical CD EDC/ECC from the disc drive hardware; no known software validates them in ripped .cdg files, and software-generated CDG files leave them zeroed. Identification relies entirely on extension + size divisibility by 24. Validation checks CDG command presence and tile coordinate bounds, but a bit flip in pixel data or color table entries is undetectable.
RealMedia
Structure
RealNetworks' container format uses a simple chunk-based structure with no CRC, hash, or checksum fields anywhere in the spec. Not in the file header, not in chunk headers, not in media packets. The only integrity verification possible is structural: chunk sizes must not exceed file bounds, num_streams must match MDPR chunk count, data_offset/index_offset must point to correct chunk types. A corrupted media packet would parse structurally but produce garbage audio/video. Even the embedded RealAudio sub-headers (.ra\xFD) contain no checksums.
MSI
Structure
MSI files are OLE2/CFBF containers. While OLE2 has internal FAT/DIFAT structure that can be validated, the MSI-specific data inside (installer tables, CAB streams, etc.) uses no additional checksums beyond what the OLE2 container provides. MSI detection itself is the challenge — we identify it by characteristic stream names (_Tables, _SummaryInformation) or the MSI CLSID, rather than a unique magic byte sequence. The OLE2 FAT structure validation catches sector-level corruption but not payload bit flips.
QOI
Structure
Quite OK Image format has a 14-byte header with no checksum. The image data uses a simple streaming codec with no per-row or per-frame checksums. A corrupted byte would cause visual artifacts but not a decoding failure.
DPX
Structure
SMPTE 268M defines a header with file size and image dimensions but no checksum field. DPX was designed for post-production pipelines where data integrity was assured by the storage/transport layer.
TGA
Structure
Truevision TGA has an 18-byte header with no checksum. The optional v2 footer provides a signature but no data integrity verification.
Video/Media Containers Without Checksums
Format
Max Depth
Why Full Validation Is Impossible
FLV
Structure
Adobe's Flash Video container has a simple tag-based structure with no checksums. Tags contain type, size, timestamp, and stream ID — all structural. Payload integrity depends entirely on the codec stream inside.
ASF/WMV/WMA
Structure
Microsoft's Advanced Systems Format uses 128-bit GUIDs for object identification and has object size fields, but no CRC or hash anywhere in the spec. ASF was designed for streaming where transport-layer integrity (TCP) was assumed.
DV
Structure
DV is a fixed-structure format (DIF blocks of 80 bytes in defined sections) with no checksums. It relies on physical tape error correction. A corrupted byte in video data is invisible at the container level.
IVF
Structure
IVF is a minimal testing container (from the WebM project) with a 32-byte file header and 12-byte frame headers. No checksums by design — it's a thin wrapper around raw VP8/VP9/AV1 frames for codec testing.
MPEG-TS
Structure
Transport Stream has 188-byte packet sync bytes and 4-bit continuity counters but no payload checksums. MPEG-TS was designed for broadcast where FEC (forward error correction) at the physical layer handles corruption. The sync byte (0x47) and continuity counter catch packet-level loss but not bit-level payload corruption.
MPEG-PS
Structure
Program Stream has pack headers with SCR timestamps and system headers, but no CRC or hash over PES packet payloads. Like TS, it relies on the transport/storage layer for data integrity.
Other Structural-Only Formats
Format
Max Depth
Why Full Validation Is Impossible
PBM/PGM/PPM/PAM
Structure
Netpbm formats are intentionally minimal ASCII/binary image formats with no metadata, no checksums, no compression. The entire spec is a magic number + dimensions + raw pixel data. A corrupted pixel byte is indistinguishable from a legitimate pixel value.
Adobe InDesign
Structure
INDD uses a proprietary binary format with a unique magic sequence but no publicly documented checksums. The internal page/object structure is documented only in Adobe's SDK, and even that doesn't expose integrity verification primitives.
Adobe After Effects
Structure
AEP files use RIFX (big-endian RIFF) containers. RIFF has chunk IDs and sizes but no per-chunk checksums.
Adobe Illustrator
Structure
Modern AI files are PDF-based (validated as PDF when possible) or PostScript-based. Legacy PS-based AI files have no checksum mechanism.
RTF
Structure
RTF is a plain-text markup format. Validation is limited to brace matching ({/}) and control word syntax. No checksums exist — RTF is just tagged text.
WordPerfect
Structure
WPD has a header signature and document area offset but no checksums in the publicly known spec. The format is proprietary and largely undocumented beyond the header.
Reaper
Structure
RPP files are UTF-8 text with bracket-delimited sections. No checksums — it's a human-readable text format, like a structured config file.
Access MDB/ACCDB
Structure
Jet/ACE database format has page structures but page-level checksums are not present in MDB (Jet 3.x/4.0) and only optionally present in ACCDB. Unlike SQLite (which has per-page checksums), Access relied on the filesystem for data integrity.
STL/DXF/STEP
Structure
CAD interchange formats (STL, DXF, STEP) are often plain-text or minimal-binary formats designed for cross-tool portability. None include checksums — they rely on the file system. STL's binary variant has a triangle count but no data integrity verification.
MBOX
Structure
MBOX is a plain-text concatenation of email messages delimited by "From " lines. No checksums — it predates modern integrity mechanisms, designed when filesystem reliability was assumed.
YAML
Structure
YAML is a text serialization format. We currently only detect structure (not full parse) because the Zig ecosystem lacks a mature YAML parser. This is a tooling limitation, not a format limitation — full parse would achieve Integrity.
Formats with partial integrity coverage (not listed above): Some formats like MPEG-TS have sync bytes and continuity counters but no payload checksums — these provide structural validation that catches gross corruption (packet loss, desync) but not bit-level payload corruption. These are documented inline in their respective table rows.
*Video containers reach Integrity level when video frames can be decoded (supported codecs: H.264, H.265, AV1, VP9, ProRes, MPEG-1/2, MJPEG). Falls back to Structure for unsupported codecs or files >100MB.
Video Codecs
Codec
Containers
Library
License
Validation
Notes
GT
H.264/AVC
MP4, MKV, MOV
Pure Zig
—
✅ Full Decode
NAL/SPS/PPS/slice + CAVLC/CABAC entropy decode
—
H.265/HEVC
MP4, MKV, MOV
Pure Zig
—
✅ Full Decode
NAL/VPS/SPS/PPS validation
—
AV1
MP4, MKV, WebM
Pure Zig
—
✅ Full Decode
OBU sequence/frame header validation
—
VP9
WebM, MKV
Pure Zig
—
✅ Full Decode
Frame header parsing
—
VP8
WebM
Pure Zig
—
✅ Full Decode
DCT coefficient decode via boolean arithmetic coder + IDCT
Note: Full byte-level validation of legacy Office streams is planned someday, but each format’s spec is roughly 600–1000 pages, so deeper support may take time.
CRC-16/IBM (poly 0xA001) per entry header (classic); CRC-16/CCITT (poly 0x1021) per entry (v5)
Checksum
1
StuffIt X
.sitx
"StuffIt!" magic (8 bytes)
Element stream walk to type-0 terminator; per-element CRC-32 where present
Checksum
1
Compact Pro
.cpt
Header magic, archive structure
—
Structure
—
DAW Project Formats
Format
Extensions
Basic Validation
Deep Validation
Max Depth
GT
Ableton Live
.als
Gzip-compressed XML detection
Full gzip CRC32 verification
Checksum
—
Reaper
.rpp
UTF-8 text, <REAPER_PROJECT header
Bracket structure parsing
Structure
—
Logic Pro X
.logicx
ZIP-based package structure
CRC32 per entry
Checksum
—
FL Studio
.flp
FLhd/FLdt chunk structure
Full TLV event parsing
Full Decode
—
Studio One
.song
ZIP-based, metainfo.xml detection
CRC32 per entry
Checksum
—
Bitwig
.bwproject
Size check + ZIP rejection only
—
⚠️ Stub
—
Cubase
.cpr
RIFF magic only (no chunk parsing)
—
⚠️ Stub
—
Pro Tools
.ptx
Size check + ZIP rejection only
—
⚠️ Stub
—
GarageBand
.band
Size check only
—
⚠️ Stub
—
Reason
.reason
Size check + ZIP rejection only
—
⚠️ Stub
—
Note: Bitwig, Pro Tools, GarageBand, and Reason use proprietary undocumented binary formats.
These return WARN (not OK) — format is recognized but corruption detection is unreliable.
Deep validation would require reverse-engineering these formats.
Database
Format
Extensions
Basic Validation
Deep Validation
Max Depth
GT
SQLite
.db, .sqlite, .sqlite3
Header magic, page structure
PRAGMA integrity_check
Integrity
—
Microsoft Access 97-2003
.mdb
Magic (00 01 00 00) + "Standard Jet DB"
Structural validation
Structure
—
Microsoft Access 2007+
.accdb
Magic (00 01 00 00) + "Standard ACE DB"
Structural validation
Structure
—
Version Control
Format
Extensions
Basic Validation
Deep Validation
Max Depth
GT
Git Repository
.git/
.git directory + loose object SHA-1 verification
Packfile trailing SHA-1, pack index SHA-1, .git/index checksum
Integrity
—
Scientific Data Formats
Format
Extensions
Basic Validation
Deep Validation
Max Depth
GT
HDF5
.h5, .hdf5
Signature (89 HDF), superblock
Superblock version, offset/length sizes, root group address
Cascading control totals: account (49) → group (98) → file (99), record counts at all levels
Integrity
1
EDI (Electronic Data Interchange)
Format
Extensions
Basic Validation
Deep Validation
Max Depth
GT
X12 EDI
.edi, .x12
ISA segment parsing, self-describing delimiters
SE/GE/IEA control total cross-validation
Integrity
1
UN/EDIFACT
.edifact
UNA/UNB segment parsing, delimiter detection
UNT/UNZ message and interchange count validation
Integrity
1
PIM (Personal Information Management)
Format
Extensions
Basic Validation
Deep Validation
Max Depth
GT
iCalendar
.ics
BEGIN/END nesting, VERSION/PRODID required
Component validation (VEVENT/VTIMEZONE), DTSTART format
Integrity
1
vCard
.vcf
BEGIN/END:VCARD envelope
Version-specific required properties (FN for v4, N+FN for v3)
Integrity
1
Crypto/Certificate Formats
Format
Extensions
Basic Validation
Deep Validation
Max Depth
GT
PEM
.pem, .crt, .key
Header/footer matching (-----BEGIN/END-----)
Base64 validation + ASN.1 DER parsing inside
Integrity
1
DER
.der, .cer
ASN.1 tag 0x30 (SEQUENCE), length encoding
Recursive TLV parsing with depth limit
Integrity
1
Encrypted Files
Format
Extensions
Basic Validation
Deep Validation
Max Depth
GT
Encrypted ZIP
.zip
Structure validated, encryption detected
Skipped (no password)
Structure
—
Encrypted PDF (with password)
.pdf
Structure validated, /Encrypt detected
Skipped (password required)
Structure
—
Encrypted PDF (empty password)
.pdf
Full validation via decryption
All embedded content decoded
Full Decode
—
Note: Encrypted files are validated structurally. We cannot verify internal checksums without the decryption key. A separate upcoming product will offer parity-based protection/repair for encrypted bytes without exposing plaintext.
Trivial Protection Circumvention Policy
This project intentionally circumvents trivial or ineffective protection mechanisms solely to validate data integrity. This includes:
Protection Type
Mechanism
Circumvention
PDF empty-password encryption
Encrypted streams, empty user password
Decrypt with empty password
PDF owner-password-only
Restricts print/copy, user can open
Decrypt and validate normally
Why we do this:
These protection mechanisms provide no actual security (empty password = no password)
The "protection" merely adds processing overhead without preventing access
Our goal is to validate data integrity, not enforce ineffective restrictions
Files with trivial protection are marked with NOTICE in CLI output
What we do NOT circumvent:
Files requiring an actual password to open
Strong encryption (AES-256, etc.)
Any protection that would require key guessing or brute force
Repairable: PDFs with trivial encryption can potentially be "repaired" by removing the encryption entirely, resulting in a smaller, faster file with identical content. This is flagged as MalformationType.pdf_trivial_encryption.
Unknown/Fallback
Condition
Validation
Depth
Unknown format with valid UTF-8
Character encoding validation
Structure
Unknown format, not UTF-8
No validation (future parity product can cover this)
—
Notes
Continuous Improvement
This format list is continuously updated. We regularly add:
New formats based on user requests and common use cases
Deeper validation for existing formats as we implement additional checks
Updates are included in regular app updates at no extra cost.
Request a Format
Don't see a format you need? Contact us! We prioritize formats based on user demand.
Email: [TBD - support email]
Website: [TBD - website URL]
Validation Philosophy
Structure: We verify the file's skeleton is intact (headers, chunk boundaries, required sections)
Checksum: We verify any embedded integrity data the format provides
Full Decode: We process the entire file through its codec/parser, catching subtle corruption
Integrity: Database-level or format-complete verification (every byte covered)
GT (Ground Truth): Number of real-world example files we've verified our validator against
Lenience & Repairability (Future Work)
Some formats are treated as valid with warning when the file is openable by popular readers but exhibits specific malformations. We do this because certain error types are theoretically repairable and may be automatically fixed in a future release (not yet).
Example: truncated JBIG2 streams may be repairable in principle, but we currently only warn and do not attempt repair.
We don't perform semantic validation—we detect bitrot and corruption, not authoring errors. A valid JPEG with poor composition is still a valid JPEG.
Deep Validation Trade-offs
Deep validation (Full Decode, Integrity) is more thorough but slower:
JPEG deep: ~10-50ms per file (full decode via libjpeg-turbo)
FLAC deep: ~100ms-1s per file (full MD5 verification)
ZIP deep: Depends on archive size (decompresses all entries)
Basic (structural) validation is fast enough for real-time scanning of large libraries.
GPL Blocked: WMV/VC-1, RealVideo/Audio - would require optional plugin architecture (DTS was previously blocked, now implemented in pure Zig)
Note: Six C library dependencies (OpenH264, libde265, dav1d, libvpx, libheif, libfdk-aac) were replaced with pure-Zig validators in February 2026. VideoToolbox (macOS hardware decoder) was also removed.
Measured via scripts/corruption-experiment with 100 trials each, PCG32 seed=42.
Sniper = single random bit flip. Shotgun = 4KB random overwrite at random offset.
100%/100% Detection (perfect)
Format
Method
TTF/OTF
Per-table checksums (strict mode for standalone fonts)
EAC3
Full-file CRC (all frames)
FITS
CHECKSUM/DATASUM keywords
WOFF
Zlib decompress + origChecksum per table
Game Boy
Global checksum (sum all ROM bytes vs u16 at 0x14E-0x14F)