-
Notifications
You must be signed in to change notification settings - Fork 382
Adding Common Voice Preprocessing for Farsi #1997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughA new Persian-specific text normalization branch was added to the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~7 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
egs/commonvoice/ASR/prepare.sh (1)
45-45: Fix typo in comment and improve clarity.The comment contains a typo and could be more specific about when manual intervention is needed.
-release=cv-corpus-12.0-2022-12-07 ## -> consider changing relaese name or download the file manually and move it to download folder. +release=cv-corpus-12.0-2022-12-07 ## -> consider changing release name or download the file manually and move it to download folder.Consider making the comment more specific about the circumstances requiring manual intervention:
-release=cv-corpus-12.0-2022-12-07 ## -> consider changing relaese name or download the file manually and move it to download folder. +release=cv-corpus-12.0-2022-12-07 ## -> update release name as needed, or download manually if automatic download fails
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
egs/commonvoice/ASR/local/preprocess_commonvoice.py(1 hunks)egs/commonvoice/ASR/prepare.sh(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (60)
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
🔇 Additional comments (1)
egs/commonvoice/ASR/local/preprocess_commonvoice.py (1)
55-59: LGTM! Well-implemented Persian text normalization.The Persian text normalization logic is correctly implemented:
- Character replacements (Arabic yeh/kaf → Persian yeh/kaf) are linguistically accurate
- Unicode range filtering appropriately includes Arabic block (\u0600-\u06FF) and both Latin (0-9) and Persian digits (\u06F0-\u06F9)
- Punctuation removal and whitespace normalization are appropriate for ASR preprocessing
- Correctly omits case conversion since Persian doesn't have uppercase/lowercase distinction
This pull request introduces a new normalization rule for Persian text in the
normalize_textfunction and adds a comment in theprepare.shscript regarding the release file handling. Below is a breakdown of the most important changes:Text Normalization Enhancements:
egs/commonvoice/ASR/local/preprocess_commonvoice.py: Added a new normalization rule for Persian (fa) text. This includes replacing Arabic characters with their Persian equivalents, removing unwanted characters, collapsing multiple spaces into one, and stripping leading/trailing spaces.Script Comments Update:
egs/commonvoice/ASR/prepare.sh: Added a comment suggesting either changing the release name or manually downloading and moving the file to thedownloadfolder for better clarity and handling.Summary by CodeRabbit