Skip to content

rielas/kolmox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kolmox — HTML Structural Similarity in Rust

Kolmox measures Normalized Compression Distance (NCD) between text documents using Brotli or Zstd compression. It was built to compare HTML page structure — faster and more accurate than html-similarity or niteru, without the brittleness of XML diff.

Features

  • Brotli and Zstd backends with tunable quality/speed tradeoffs
  • In-memory caching for batch comparisons (no redundant compression)
  • HTML-aware filters to strip noise before comparison

Installation

[dependencies]
kolmox = "0.1.2"

Distance Formula

Normalized Compression Distance is calculated as:

$$ NCD(x, y) = \frac{C(xy) - \min(C(x), C(y))}{\max(C(x), C(y))} $$

Where:

  • $C(x)$ is the compressed size of text $x$
  • $C(xy)$ is the compressed size of the concatenation of texts $x$ and $y$
  • The result is normalized between 0 (identical) and 1 (completely different)

Configuration

Brotli Parameters

CompressBrotli::new(quality: u32, lg_window_size: u32)
// quality: 1-11 (higher = better compression, slower)
// lg_window_size: 10-24 (logarithmic window size)

CompressBrotli::recommended()      // quality=5, lg_window_size=21
CompressBrotli::max_quality()      // quality=11, lg_window_size=24

Zstd Parameters

CompressZstd::recommended()        // Balanced speed/compression

Examples

Computing Distance Between Two Texts

use kolmox::compress::{brotli::CompressBrotli, Compressor};

let compressor = CompressBrotli::recommended();
let distance = compressor.get_distance(&text1, &text2);
println!("NCD: {:.4}", distance);

Batch Processing with Cache

use kolmox::compress::{brotli::CompressBrotli, cache::InMemoryCache};

let compressor = CompressBrotli::<InMemoryCache>::recommended();
// Repeated comparisons reuse cached compression results

References

About

Measure HTML structural similarity via Normalized Compression Distance

Topics

Resources

Stars

Watchers

Forks

Contributors