Skip to content

Commit b59f98f

Browse files
authored
use fancy-regex instead of onig as tokenizers regex library (#172)
The version of Oniguruma used in `onig_sys` doesn't build on GCC 15 and the oniguruma project itself got archived last week, so this PR switches tokenizers to the fancy-regex backend. `fancy-regex` also requires flipping on the `unstable_wasm` feature until huggingface/tokenizers#1772 lands, that flag doesn't have any ill effects though since everything WASM related downstream is behind `target_arch` checks. **tl;dr**: This fixes builds on Linux distros with newer GCC versions like Arch Linux and Fedora.
1 parent c89e386 commit b59f98f

File tree

2 files changed

+14
-37
lines changed

2 files changed

+14
-37
lines changed

Cargo.lock

Lines changed: 10 additions & 36 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

toktrie_hf_tokenizers/Cargo.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,8 @@ toktrie = { workspace = true }
1111
serde = { version = "1.0.217", features = ["derive"] }
1212
serde_json = "1.0.138"
1313
anyhow = "1.0.95"
14-
tokenizers = { version = ">=0.20.0, <1.0.0", default-features = false, features = ["onig"] }
14+
tokenizers = { version = ">=0.20.0, <1.0.0", default-features = false, features = [
15+
"unstable_wasm",
16+
"fancy-regex",
17+
] }
1518
log = "0.4.25"

0 commit comments

Comments
 (0)