Skip to content

feat: Arabic diacritization via text2tashkeel (rawi); drop vendored libtashkeel#152

Open
JarbasAl wants to merge 1 commit into
devfrom
feat/text2tashkeel-diacritizer
Open

feat: Arabic diacritization via text2tashkeel (rawi); drop vendored libtashkeel#152
JarbasAl wants to merge 1 commit into
devfrom
feat/text2tashkeel-diacritizer

Conversation

@JarbasAl

@JarbasAl JarbasAl commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Replaces the vendored libtashkeel with text2tashkeel for all Arabic tashkeel. The default model is rawi-ensemble, which — unlike libtashkeel's 15-class scheme — also restores hamza and the dagger alef, fixing inconsistently-spelled input:

input        : اكرم محمد ابراهيم
old (libtashkeel): اِكْرَمْ مُحَمَّدْ ابْرَاهِيم      ← vowels only, bare ا kept
new (rawi)       : أَكْرَمَ مُحَمَّدٌ إبْرَاهِيمَ      ← hamzas restored (أ, إ)

Changes

  • Remove phoonnx/thirdparty/tashkeel/ (the vendored libtashkeel + its 4.8 MB ONNX).
  • BasePhonemizer: a lazy text2tashkeel diacritizer (model configurable, default rawi-ensemble); add_diacritics(text, lang, model=None) routes Arabic to it.
  • VoiceConfig / SynthesisConfig: arabic_diacritizer_model (default rawi-ensemble), round-tripped through config so a voice can pick a model in its inference block; voice.py threads it through.
  • The [ar] extra now requires text2tashkeel; a clear ImportError is raised if Arabic is used without it.

Notes

  • Arabic now requires pip install phoonnx[ar] (was zero-extra via the vendored model).
  • taskeen_threshold (a libtashkeel-only knob) is removed.
  • Config + Arabic phonemizer tests pass (46).

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

@JarbasAl, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 25 minutes and 31 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ced3f825-6a57-4abd-88d5-98edbbae4040

📥 Commits

Reviewing files that changed from the base of the PR and between b2faeef and 966a23d.

📒 Files selected for processing (11)
  • phoonnx/config.py
  • phoonnx/phonemizers/base.py
  • phoonnx/thirdparty/tashkeel/LICENSE
  • phoonnx/thirdparty/tashkeel/SOURCE
  • phoonnx/thirdparty/tashkeel/__init__.py
  • phoonnx/thirdparty/tashkeel/hint_id_map.json
  • phoonnx/thirdparty/tashkeel/input_id_map.json
  • phoonnx/thirdparty/tashkeel/model.onnx
  • phoonnx/thirdparty/tashkeel/target_id_map.json
  • phoonnx/voice.py
  • pyproject.toml
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/text2tashkeel-diacritizer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

Checking back in with the latest test results. 📡

I've aggregated the results of the automated checks for this PR below.

📋 Repo Health

Ensuring the codebase isn't suffering from 'technical debt' flu. 🤒

⚠️ Some required files are missing.

Latest Version: 1.15.0a1

phoonnx/version.py — Version file
README.md — README
LICENSE — License file
pyproject.toml — pyproject.toml
⚠️ setup.py — setup.py
CHANGELOG.md — Changelog
phoonnx/version.py has valid version block markers

🏷️ Release Preview

A detailed preview of the next release cycle. 🎬

Current: 1.15.0a1Next: 1.16.0a1

Signal Value
Label feature
PR title feat: Arabic diacritization via text2tashkeel (rawi); drop vendored libtashkeel
Bump minor

✅ PR title follows conventional commit format.


🚀 Release Channel Compatibility

Predicted next version: 1.16.0a1

Channel Status Note Current Constraint
Stable Not in channel -
Testing Not in channel -
Alpha Not in channel -

🔍 Lint

Everything looks good so far! ✅

ruff: issues found — see job log

📊 Coverage

Ensuring the code doesn't have any blind spots. 🕶️

40.0% total coverage

Files below 80% coverage (36 files)
File Coverage Missing lines
phoonnx/cli.py 0.0% 98
phoonnx/thirdparty/kog2p/__init__.py 0.0% 203
phoonnx/thirdparty/mantoq/unicode_symbol2label.py 0.0% 1
phoonnx/thirdparty/bw2ipa.py 7.5% 86
phoonnx/thirdparty/mantoq/pyarabic/number.py 7.7% 371
phoonnx/thirdparty/mantoq/buck/phonetise_buckwalter.py 10.4% 180
phoonnx/thirdparty/hangul2ipa.py 16.6% 372
phoonnx/phonemizers/en.py 17.5% 104
phoonnx/thirdparty/mantoq/pyarabic/trans.py 18.2% 135
phoonnx/model_manager.py 19.9% 214
phoonnx/voice.py 21.7% 220
phoonnx/thirdparty/zh_num.py 23.1% 83
phoonnx/phonemizers/mul.py 23.9% 236
phoonnx/phonemizers/zh.py 27.0% 92
phoonnx/phonemizers/ko.py 30.4% 32
phoonnx/phonemizers/gl.py 31.1% 42
phoonnx/phonemizers/ar.py 31.2% 44
phoonnx/thirdparty/mantoq/buck/tokenization.py 32.5% 27
phoonnx/thirdparty/phonikud/__init__.py 35.3% 11
phoonnx/phonemizers/ja.py 36.0% 32
phoonnx/phonemizers/fa.py 36.4% 14
phoonnx/phonemizers/pt.py 38.1% 13
phoonnx/thirdparty/mantoq/pyarabic/normalize.py 38.1% 13
phoonnx/phonemizers/base.py 38.2% 76
phoonnx/thirdparty/mantoq/pyarabic/araby.py 39.7% 298
phoonnx/phonemizers/he.py 40.0% 12
phoonnx/phonemizers/vi.py 40.0% 12
phoonnx/thirdparty/mantoq/pyarabic/stack.py 45.5% 6
phoonnx/thirdparty/mantoq/num2words.py 47.6% 11
phoonnx/phonemizers/mwl.py 50.0% 8
phoonnx/tokenizer.py 52.4% 147
phoonnx/thirdparty/mantoq/__init__.py 60.0% 10
phoonnx/thirdparty/mantoq/pyarabic/arabrepr.py 60.0% 6
phoonnx/engines/vocoders/griffinlim.py 61.4% 27
phoonnx/config.py 61.5% 130
phoonnx/engines/optispeech.py 69.6% 24

Full report: download the coverage-report artifact.

🔒 Security (pip-audit)

Ensuring our cookies are secure and fresh. 🍪

✅ No known vulnerabilities found (61 packages scanned).

⚖️ License Check

The license report is filed and ready for review. 📁

❌ License violations detected (43 packages) — review required before merging.

Dependency                          License Name                                            License Type         Misc                                    
phoonnx:1.3.3                       Error                                                   Error                                                        

License Type                        Found                                                  
Error                               1

License distribution: 14× MIT License, 7× Apache Software License, 5× MIT, 3× Apache-2.0, 2× BSD-3-Clause, 2× ISC License (ISCL), 1× 3-Clause BSD License, 1× Apache Software License; BSD License, +8 more

Full breakdown — 43 packages
Package Version License URL
build 1.5.0 MIT link
certifi 2026.5.20 Mozilla Public License 2.0 (MPL 2.0) link
charset-normalizer 3.4.7 MIT link
click 8.4.1 BSD-3-Clause link
combo_lock 0.3.1 Apache-2.0 link
dateparser 1.4.0 BSD License link
filelock 3.29.1 MIT link
flatbuffers 25.12.19 Apache Software License link
idna 3.18 BSD-3-Clause link
json-database 0.10.1 MIT link
kthread 0.2.3 MIT License link
langcodes 3.5.1 MIT License link
markdown-it-py 4.2.0 MIT License link
mdurl 0.1.2 MIT License link
memory-tempfile 2.2.3 MIT License link
numpy 2.4.6 BSD-3-Clause AND 0BSD AND MIT AND Zlib AND CC0-1.0 link
onnxruntime 1.26.0 MIT License link
ovos-config 2.1.1 Apache-2.0 link
ovos-date-parser 0.7.0a5 Apache Software License link
ovos-number-parser 0.5.1 Apache Software License link
ovos-utils 0.8.5 Apache-2.0 link
packaging 26.2 Apache-2.0 OR BSD-2-Clause link
pexpect 4.9.0 ISC License (ISCL) link
phoonnx 1.15.0a1 Apache Software License link
protobuf 7.35.0 3-Clause BSD License link
ptyprocess 0.7.0 ISC License (ISCL) link
pyee 13.0.1 MIT License link
Pygments 2.20.0 BSD-2-Clause link
pyproject_hooks 1.2.0 MIT License link
python-dateutil 2.9.0.post0 Apache Software License; BSD License link
pytz 2026.2 MIT License link
PyYAML 6.0.3 MIT License link
quebra-frases 0.3.7 Apache Software License link
regex 2026.5.9 Apache-2.0 AND CNRI-Python link
requests 2.34.2 Apache Software License link
rich 13.9.4 MIT License link
rich-click 1.9.8 MIT License

Copyright (c) 2022 Phil Ewels

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
| link |
| six | 1.17.0 | MIT License | link |
| typing_extensions | 4.15.0 | PSF-2.0 | link |
| tzlocal | 5.3.1 | MIT License | link |
| unicode-rbnf | 2.4.0 | MIT License | |
| urllib3 | 2.7.0 | MIT | link |
| watchdog | 6.0.0 | Apache Software License | link |

Policy: Apache 2.0 (universal donor). StrongCopyleft / NetworkCopyleft / WeakCopyleft / Other / Error categories fail. MPL allowed.

🔨 Build Tests

Checking if the code is properly tempered. ⚔️

✅ All versions pass

Python Build Install Tests
3.10
3.11
3.12
3.13
3.14

Delivered by the OVOS Automated Messenger 🕊️

@JarbasAl JarbasAl force-pushed the feat/text2tashkeel-diacritizer branch from a3a39ee to 676ce6d Compare June 7, 2026 15:56
@JarbasAl JarbasAl changed the title feat: optional text2tashkeel Arabic diacritizer (restores hamza/dagger-alef) feat: Arabic diacritization via text2tashkeel (rawi); drop vendored libtashkeel Jun 7, 2026
@github-actions github-actions Bot added feature and removed feature labels Jun 7, 2026
@JarbasAl JarbasAl marked this pull request as ready for review June 7, 2026 15:58
…ibtashkeel

Replace the vendored libtashkeel with text2tashkeel for all Arabic tashkeel. The
default model is rawi-ensemble, which restores hamza and the dagger alef in addition
to the standard marks — so it fixes inconsistently-spelled input that libtashkeel's
15-class scheme left as-is (e.g. bare 'ا' -> 'أ').

- remove phoonnx/thirdparty/tashkeel/ (vendored libtashkeel + its 4.8 MB onnx).
- BasePhonemizer: a lazy text2tashkeel diacritizer; add_diacritics(text, lang,
  model=None) routes Arabic to it.
- config: a generic 'diacritizer_model' key (default rawi-ensemble) on VoiceConfig
  and SynthesisConfig, round-tripped through config and threaded through voice.py —
  named generically so future languages that need a diacritizer model reuse it.
- the [ar] extra requires text2tashkeel; a clear ImportError is raised if missing.

Config and Arabic phonemizer tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@JarbasAl

JarbasAl commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Companion tracking issue: #175

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant