Fast Rust implementation (optional Python bindings via PyO3) targeting bit‑for‑bit equivalence with Python Unidecode. Provides:
- Same output as
Unidecodefor all covered tables - Noticeably higher performance (see perf snapshot in tests)
- Golden tests comparing dynamically against the Python version
- High coverage on critical paths (bitmap + per‑block dispatch)
src/ # Core library sources + generated tables
benches/ # Criterion benchmarks (Rust)
scripts/ # Developer helper scripts (bench_compare, coverage)
tests/ # Rust integration & golden tests
tests/python/ # Python parity & upstream harness
python/ # Python shim for upstream-compatible API
docs/ # Coverage and performance documentation
- Rust usage:
unidecode_rs::unidecode("déjà") -> "deja" - Python usage: build extension with
maturin develop --features python - Idempotence:
unidecode(unidecode(x)) == unidecode(x)(after first pass everything is ASCII) - Golden tests: ensure exact parity with Python
use unidecode_rs::unidecode;
fn main() {
println!("{}", unidecode("PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ")); // PRILIS ZLUTOUCKY KUN
}cargo add unidecode-rs
# or add manually in Cargo.toml then
cargo buildPrerequisites: Rust stable, Python ≥3.8, pip.
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip maturin
maturin develop --release --features python
python -c "import unidecode_rs; print(unidecode_rs.unidecode('déjà vu'))"To build a distributable wheel:
maturin build --release --features python -o dist/
# Wheels are placed in dist/ directory
pip install dist/unidecode_pyo3-*.whlOr install from PyPI:
pip install unidecode-pyo3import unidecode_rs
print(unidecode_rs.unidecode("Příliš žluťoučký kůň"))Minimal API: single function unidecode(text: str, errors: Optional[str] = None, replace_str: Optional[str] = None) -> str.
A function is idempotent if applying it multiple times yields the same result as applying it once. Here:
unidecode(unidecode(s)) == unidecode(s)
After the first transliteration the output is pure ASCII; a second pass does nothing. A dedicated test validates this over multi‑script samples.
golden_equivalence tests run the Python Unidecode library in a subprocess and diff outputs across samples (Latin + accents, Cyrillic, Greek, CJK, emoji). Any mismatch fails the test.
Targeted run:
cargo test -- --nocapture golden_equivalenceDispatch design:
- Presence bitmap per 256‑codepoint block (
BLOCK_BITMAPS) for quick negative checks. - Large generated
matchproviding PHF table access per block.
Extra tests (lookup_paths.rs + internal tests in lib.rs) exercise:
- Bit zero ⇒
lookupreturnsNone(negative path) - Bit one ⇒
lookupreturns non‑empty string - Out‑of‑range block ⇒ early exit
- ASCII parity / idempotence
Generate local report via cargo llvm-cov (alias if configured). Detailed guidance moved to docs/COVERAGE.md.
cargo llvm-cov --html
# Or use the provided script:
./scripts/coverage.shBeyond Rust & golden tests, a Python harness reuses the original upstream Unidecode test suite to assert behavioral parity.
Main file: tests/python/test_reference_suite.py
Characteristics:
- Dynamically loads the upstream base test class (via
_reference/upstream_loader.py). - Monkeypatches
unidecode.unidecodeto point to the Rust implementation (unidecode_rs.unidecode). - Implements full
errors=modes (ignore,replace,strict,preserve) for parity. - Overrides surrogate tests with lean variants to avoid warning noise while maintaining assertions.
Run only this suite:
pytest -q tests/python/test_reference_suite.pyExpected (evolving) report:
14 passed, 2 xfailed, 4 xpassed # exemple actuel
xfail / xpass policy:
- Temporary
xfailremoved once feature implemented; a formerxfailthat passes becomes a normal pass.
Parity roadmap:
- (Done) Implement
errors=modes. - Finalize surrogate handling parity (optional warning replication toggle).
- Extend tables to cover remaining mathematical alphanumeric symbols not yet mapped (e.g., script variants currently partial).
- Add multi‑corpus benchmarks (Latin, mixed CJK, emoji) for stable metrics.
- Provide exhaustive table diff script (block by block) with machine‑readable output.
Current limitations:
- Some mathematical script / stylistic letter ranges may still map to empty until table extension is complete.
- Generated table lines unexecuted in coverage are data-only, low semantic value.
How to contribute:
- Add a targeted parity test (Rust or Python) reproducing a divergence.
- Extend the table or adjust logic.
- Run
pytest tests/python/test_reference_suite.pyandcargo test. - Update this section if a batch of former gaps is closed.
🚀 Optimized for Speed: Current implementation is ~6.2x faster than Python Unidecode.
Benchmark results (on sample text with 10K iterations):
- Python Unidecode: 77.9 ms
- Rust unidecode-rs: 12.6 ms
- Speedup: 6.2x
Key optimizations:
- Zero-copy for pure ASCII input (via
Cow<str>) - Unrolled byte scanning for ASCII sequences
- Smart capacity pre-allocation (CJK-aware)
- Selective NFKD decomposition (only for mathematical symbols)
- Optimized PyO3 bindings (minimal conversions)
For detailed benchmarks:
# Criterion benchmarks (Rust)
cargo bench
# Python vs Rust comparison
python scripts/bench_compare.pySee OPTIMIZATIONS.md for implementation details.
- Fidelity: match Python before adding new rules.
- Safety: no panics for any valid Unicode scalar value.
- Performance: avoid unnecessary copies (ASCII fast path, heuristic pre‑allocation).
- Maintainability: generated code isolated, core logic compact.
cargo test
# (optional) fallback feature using deunicode
cargo test --features fallback-deunicodePython tests (after building extension):
pytest tests/pythonGPL-3.0-or-later. Tables derived from public data of the Python Unidecode project.
- Original Python project Unidecode
- Rust & PyO3 community