Kolmox measures Normalized Compression Distance (NCD) between text documents using Brotli or Zstd compression. It was built to compare HTML page structure — faster and more accurate than html-similarity or niteru, without the brittleness of XML diff.
- Brotli and Zstd backends with tunable quality/speed tradeoffs
- In-memory caching for batch comparisons (no redundant compression)
- HTML-aware filters to strip noise before comparison
[dependencies]
kolmox = "0.1.2"Normalized Compression Distance is calculated as:
Where:
-
$C(x)$ is the compressed size of text$x$ -
$C(xy)$ is the compressed size of the concatenation of texts$x$ and$y$ - The result is normalized between 0 (identical) and 1 (completely different)
CompressBrotli::new(quality: u32, lg_window_size: u32)
// quality: 1-11 (higher = better compression, slower)
// lg_window_size: 10-24 (logarithmic window size)
CompressBrotli::recommended() // quality=5, lg_window_size=21
CompressBrotli::max_quality() // quality=11, lg_window_size=24CompressZstd::recommended() // Balanced speed/compressionuse kolmox::compress::{brotli::CompressBrotli, Compressor};
let compressor = CompressBrotli::recommended();
let distance = compressor.get_distance(&text1, &text2);
println!("NCD: {:.4}", distance);use kolmox::compress::{brotli::CompressBrotli, cache::InMemoryCache};
let compressor = CompressBrotli::<InMemoryCache>::recommended();
// Repeated comparisons reuse cached compression results- Normalized Compression Distance: https://en.wikipedia.org/wiki/Normalized_compression_distance
- Brotli: https://github.com/google/brotli
- Zstd: https://github.com/facebook/zstd