feat: Arabic diacritization via text2tashkeel (rawi); drop vendored libtashkeel#152
feat: Arabic diacritization via text2tashkeel (rawi); drop vendored libtashkeel#152JarbasAl wants to merge 1 commit into
Conversation
|
Warning Review limit reached
More reviews will be available in 25 minutes and 31 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (11)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Checking back in with the latest test results. 📡I've aggregated the results of the automated checks for this PR below. 📋 Repo HealthEnsuring the codebase isn't suffering from 'technical debt' flu. 🤒 Latest Version: ✅ 🏷️ Release PreviewA detailed preview of the next release cycle. 🎬 Current:
✅ PR title follows conventional commit format. 🚀 Release Channel Compatibility Predicted next version:
🔍 LintEverything looks good so far! ✅ ❌ ruff: issues found — see job log 📊 CoverageEnsuring the code doesn't have any blind spots. 🕶️ ❌ 40.0% total coverage Files below 80% coverage (36 files)
Full report: download the 🔒 Security (pip-audit)Ensuring our cookies are secure and fresh. 🍪 ✅ No known vulnerabilities found (61 packages scanned). ⚖️ License CheckThe license report is filed and ready for review. 📁 ❌ License violations detected (43 packages) — review required before merging. License distribution: 14× MIT License, 7× Apache Software License, 5× MIT, 3× Apache-2.0, 2× BSD-3-Clause, 2× ISC License (ISCL), 1× 3-Clause BSD License, 1× Apache Software License; BSD License, +8 more Full breakdown — 43 packages
Copyright (c) 2022 Phil Ewels Permission is hereby granted, free of charge, to any person obtaining a copy The above copyright notice and this permission notice shall be included in all THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR Policy: Apache 2.0 (universal donor). StrongCopyleft / NetworkCopyleft / WeakCopyleft / Other / Error categories fail. MPL allowed. 🔨 Build TestsChecking if the code is properly tempered. ⚔️ ✅ All versions pass
Delivered by the OVOS Automated Messenger 🕊️ |
a3a39ee to
676ce6d
Compare
…ibtashkeel Replace the vendored libtashkeel with text2tashkeel for all Arabic tashkeel. The default model is rawi-ensemble, which restores hamza and the dagger alef in addition to the standard marks — so it fixes inconsistently-spelled input that libtashkeel's 15-class scheme left as-is (e.g. bare 'ا' -> 'أ'). - remove phoonnx/thirdparty/tashkeel/ (vendored libtashkeel + its 4.8 MB onnx). - BasePhonemizer: a lazy text2tashkeel diacritizer; add_diacritics(text, lang, model=None) routes Arabic to it. - config: a generic 'diacritizer_model' key (default rawi-ensemble) on VoiceConfig and SynthesisConfig, round-tripped through config and threaded through voice.py — named generically so future languages that need a diacritizer model reuse it. - the [ar] extra requires text2tashkeel; a clear ImportError is raised if missing. Config and Arabic phonemizer tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
676ce6d to
966a23d
Compare
|
Companion tracking issue: #175 |
Replaces the vendored libtashkeel with text2tashkeel for all Arabic tashkeel. The default model is
rawi-ensemble, which — unlike libtashkeel's 15-class scheme — also restores hamza and the dagger alef, fixing inconsistently-spelled input:Changes
phoonnx/thirdparty/tashkeel/(the vendored libtashkeel + its 4.8 MB ONNX).BasePhonemizer: a lazy text2tashkeel diacritizer (model configurable, defaultrawi-ensemble);add_diacritics(text, lang, model=None)routes Arabic to it.VoiceConfig/SynthesisConfig:arabic_diacritizer_model(defaultrawi-ensemble), round-tripped through config so a voice can pick a model in itsinferenceblock;voice.pythreads it through.[ar]extra now requirestext2tashkeel; a clearImportErroris raised if Arabic is used without it.Notes
pip install phoonnx[ar](was zero-extra via the vendored model).taskeen_threshold(a libtashkeel-only knob) is removed.🤖 Generated with Claude Code