unidecode-rs - Unicode → ASCII transliteration faithful to Python

Fast Rust implementation (optional Python bindings via PyO3) targeting bit‑for‑bit equivalence with Python Unidecode. Provides:

Same output as Unidecode for all covered tables
Noticeably higher performance (see perf snapshot in tests)
Golden tests comparing dynamically against the Python version
High coverage on critical paths (bitmap + per‑block dispatch)

Repository layout

src/                # Core library sources + generated tables
benches/            # Criterion benchmarks (Rust)
scripts/            # Developer helper scripts (bench_compare, coverage)
tests/              # Rust integration & golden tests
tests/python/       # Python parity & upstream harness
python/             # Python shim for upstream-compatible API
docs/               # Coverage and performance documentation

Quick summary

Rust usage: unidecode_rs::unidecode("déjà") -> "deja"
Python usage: build extension with maturin develop --features python
Idempotence: unidecode(unidecode(x)) == unidecode(x) (after first pass everything is ASCII)
Golden tests: ensure exact parity with Python

Rust example

use unidecode_rs::unidecode;

fn main() {
	println!("{}", unidecode("PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ")); // PRILIS ZLUTOUCKY KUN
}

Install / build (Rust only)

cargo add unidecode-rs
# or add manually in Cargo.toml then
cargo build

Build the Python extension (development)

Prerequisites: Rust stable, Python ≥3.8, pip.

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip maturin
maturin develop --release --features python
python -c "import unidecode_rs; print(unidecode_rs.unidecode('déjà vu'))"

To build a distributable wheel:

maturin build --release --features python -o dist/
# Wheels are placed in dist/ directory
pip install dist/unidecode_pyo3-*.whl

Or install from PyPI:

pip install unidecode-pyo3

Python API

import unidecode_rs
print(unidecode_rs.unidecode("Příliš žluťoučký kůň"))

Minimal API: single function unidecode(text: str, errors: Optional[str] = None, replace_str: Optional[str] = None) -> str.

Idempotence - what is it?

A function is idempotent if applying it multiple times yields the same result as applying it once. Here:

unidecode(unidecode(s)) == unidecode(s)

After the first transliteration the output is pure ASCII; a second pass does nothing. A dedicated test validates this over multi‑script samples.

Golden tests (Python parity)

golden_equivalence tests run the Python Unidecode library in a subprocess and diff outputs across samples (Latin + accents, Cyrillic, Greek, CJK, emoji). Any mismatch fails the test.

Targeted run:

cargo test -- --nocapture golden_equivalence

Coverage & critical paths

Dispatch design:

Presence bitmap per 256‑codepoint block (BLOCK_BITMAPS) for quick negative checks.
Large generated match providing PHF table access per block.

Extra tests (lookup_paths.rs + internal tests in lib.rs) exercise:

Bit zero ⇒ lookup returns None (negative path)
Bit one ⇒ lookup returns non‑empty string
Out‑of‑range block ⇒ early exit
ASCII parity / idempotence

Generate local report via cargo llvm-cov (alias if configured). Detailed guidance moved to docs/COVERAGE.md.

cargo llvm-cov --html
# Or use the provided script:
./scripts/coverage.sh

Upstream test harness

Beyond Rust & golden tests, a Python harness reuses the original upstream Unidecode test suite to assert behavioral parity.

Main file: tests/python/test_reference_suite.py

Characteristics:

Dynamically loads the upstream base test class (via _reference/upstream_loader.py).
Monkeypatches unidecode.unidecode to point to the Rust implementation (unidecode_rs.unidecode).
Implements full errors= modes (ignore, replace, strict, preserve) for parity.
Overrides surrogate tests with lean variants to avoid warning noise while maintaining assertions.

Run only this suite:

pytest -q tests/python/test_reference_suite.py

Expected (evolving) report:

14 passed, 2 xfailed, 4 xpassed  # exemple actuel

xfail / xpass policy:

Temporary xfail removed once feature implemented; a former xfail that passes becomes a normal pass.

Parity roadmap:

(Done) Implement errors= modes.
Finalize surrogate handling parity (optional warning replication toggle).
Extend tables to cover remaining mathematical alphanumeric symbols not yet mapped (e.g., script variants currently partial).
Add multi‑corpus benchmarks (Latin, mixed CJK, emoji) for stable metrics.
Provide exhaustive table diff script (block by block) with machine‑readable output.

Current limitations:

Some mathematical script / stylistic letter ranges may still map to empty until table extension is complete.
Generated table lines unexecuted in coverage are data-only, low semantic value.

How to contribute:

Add a targeted parity test (Rust or Python) reproducing a divergence.
Extend the table or adjust logic.
Run pytest tests/python/test_reference_suite.py and cargo test.
Update this section if a batch of former gaps is closed.

Performance

🚀 Optimized for Speed: Current implementation is ~6.2x faster than Python Unidecode.

Benchmark results (on sample text with 10K iterations):

Python Unidecode: 77.9 ms
Rust unidecode-rs: 12.6 ms
Speedup: 6.2x

Key optimizations:

Zero-copy for pure ASCII input (via Cow<str>)
Unrolled byte scanning for ASCII sequences
Smart capacity pre-allocation (CJK-aware)
Selective NFKD decomposition (only for mathematical symbols)
Optimized PyO3 bindings (minimal conversions)

For detailed benchmarks:

# Criterion benchmarks (Rust)
cargo bench

# Python vs Rust comparison
python scripts/bench_compare.py

See OPTIMIZATIONS.md for implementation details.

Philosophy

Fidelity: match Python before adding new rules.
Safety: no panics for any valid Unicode scalar value.
Performance: avoid unnecessary copies (ASCII fast path, heuristic pre‑allocation).
Maintainability: generated code isolated, core logic compact.

Development / tests

cargo test
# (optional) fallback feature using deunicode
cargo test --features fallback-deunicode

Python tests (after building extension):

pytest tests/python

License

GPL-3.0-or-later. Tables derived from public data of the Python Unidecode project.

Acknowledgements

Original Python project Unidecode
Rust & PyO3 community

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.cargo		.cargo
.github/workflows		.github/workflows
benches		benches
docs		docs
python/unidecode_rs		python/unidecode_rs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

unidecode-rs - Unicode → ASCII transliteration faithful to Python

Repository layout

Quick summary

Rust example

Install / build (Rust only)

Build the Python extension (development)

Python API

Idempotence - what is it?

Golden tests (Python parity)

Coverage & critical paths

Upstream test harness

Performance

Philosophy

Development / tests

License

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

gmaOCR/unidecode-rs

Folders and files

Latest commit

History

Repository files navigation

unidecode-rs - Unicode → ASCII transliteration faithful to Python

Repository layout

Quick summary

Rust example

Install / build (Rust only)

Build the Python extension (development)

Python API

Idempotence - what is it?

Golden tests (Python parity)

Coverage & critical paths

Upstream test harness

Performance

Philosophy

Development / tests

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages