feat: phoneme alignment — tests + docs (rebased onto dev)#125
Conversation
24 hermetic unit tests covering PhonemeAlignment, AudioChunk new fields, phoneme_ids_to_audio return shapes, hop_length scaling, synthesize() integration, and alignment reconstruction edge cases (length mismatch, alignment failure, empty text). docs/alignment.md: usage guide for include_alignments, AudioChunk fields, PhonemeAlignment, hop_length config, export CLI flag, and viseme/karaoke use-case examples. Linked from docs/README.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR implements optional phoneme-level alignment for synthesized audio. It extends the Voice API with alignment metadata, introduces hop_length configuration, adds ONNX export tooling to expose alignment tensors, and includes comprehensive tests and user documentation. ChangesPhoneme Alignment Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Ready for review! The automated tests have finished. ✅I've aggregated the results of the automated checks for this PR below. 🏷️ Release PreviewEnsuring the 'Contributors' list is sorted and complete. 👥 Current:
✅ PR title follows conventional commit format. 🚀 Release Channel Compatibility Predicted next version:
🔒 Security (pip-audit)I've scanned the dependencies for any hidden surprises. 🔍 ✅ No known vulnerabilities found (61 packages scanned). 📊 CoverageA forensic analysis of your test coverage. 🔍 ❌ 31.5% total coverage Files below 80% coverage (37 files)
Full report: download the 🔍 LintI've finished my task! Here's the data you need. 📊 ❌ ruff: issues found — see job log ⚖️ License CheckAuditing the legal lineage of this contribution. 📜 ❌ License violations detected (43 packages) — review required before merging. License distribution: 14× MIT License, 7× Apache Software License, 5× MIT, 3× Apache-2.0, 2× BSD-3-Clause, 2× ISC License (ISCL), 1× 3-Clause BSD License, 1× Apache Software License; BSD License, +8 more Full breakdown — 43 packages
Copyright (c) 2022 Phil Ewels Permission is hereby granted, free of charge, to any person obtaining a copy The above copyright notice and this permission notice shall be included in all THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR Policy: Apache 2.0 (universal donor). StrongCopyleft / NetworkCopyleft / WeakCopyleft / Other / Error categories fail. MPL allowed. 🔨 Build TestsI've finished the digital carpentry on this PR. 🔨 ✅ All versions pass
Providing clarity through automated analysis 🔍 |
Three bugs fixed: 1. AttributeError crash: VoiceConfig has no phoneme_id_map attribute — the code referenced it but it does not exist on the dataclass. Fix: read blank_id/bos_id/eos_id and idx2char from the tokenizer's vocabulary, which is where the mapping actually lives. 2. get(phoneme, []) type error: missing phonemes returned [] (a list), not an int, so the id comparison always failed and alignments silently returned None for any phoneme not found in the map. Fix: the new walk uses idx2char (reverse map) so unknown ids become "?" strings instead of crashing or silently failing. 3. Incomplete blank handling: the old loop tried to re-derive the tokenizer's internal token sequence (bos/blank/eos insertion) manually, getting it wrong for any non-default blank_at_start/blank_at_end configuration. Fix: walk phoneme_ids + phoneme_id_samples directly in lockstep; fold blank/bos durations forward into the next real phoneme and eos backward into the last one. Works for all tokenizer configurations. Also removes the now-unused DEFAULT_PAD/BOS/EOS_TOKEN imports from voice.py. Tests: 10 new unit regression tests + 9 integration tests against a real eu-ES espeak model validating that total aligned samples == audio length. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
phoonnx/voice.py (1)
315-317: ⚡ Quick winType annotations claim
intbut tokenizer properties returnOptional[int].Lines 315-317 annotate
_blank_id,_bos_id, and_eos_idasint, but the tokenizer vocabulary properties can returnNonewhen the corresponding token is not defined (see context snippets showingblank_id(),bos_id(),eos_id()returnOptional[int]).The comparison logic at lines 357, 359 handles
Nonecorrectly (no match when the ID isNone), but the type hints are misleading.📝 Proposed fix
_tok = self.config.tokenizer _vocab = _tok.vocabulary _idx2char: dict = _vocab.idx2char - _blank_id: int = _tok.blank_id - _bos_id: int = _vocab.bos_id - _eos_id: int = _vocab.eos_id + _blank_id: Optional[int] = _tok.blank_id + _bos_id: Optional[int] = _vocab.bos_id + _eos_id: Optional[int] = _vocab.eos_id🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phoonnx/voice.py` around lines 315 - 317, The type hints for _blank_id, _bos_id, and _eos_id are incorrect: they are declared as int but _tok.blank_id, _vocab.bos_id and _vocab.eos_id return Optional[int]; update the annotations to Optional[int] (or None | int) for _blank_id, _bos_id, and _eos_id so the hints match the actual values and existing comparison logic (in the region around the checks that compare these IDs) remains correct.phoonnx/phoneme_ids.py (1)
378-457: ⚡ Quick winConsider moving the
__main__test harness to the test suite.The
__main__block contains valuable validation logic that compares this implementation against external reference implementations (mimic3_phonemes2idsandpiper_phonemes_to_ids). However, it relies on hardcoded file paths and external dependencies that may not be available in all environments.Moving this to
tests/test_phoneme_ids.pyas optional integration tests (skipped when dependencies are missing) would improve maintainability and make the validation reproducible in CI.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phoonnx/phoneme_ids.py` around lines 378 - 457, The inline __main__ test harness (starting at the module's bottom) uses hardcoded paths and external libs (phoneme_ids_path, phoneme_map_path, load_phoneme_ids, EspeakPhonemizer, phonemes_to_ids, mimic3_phonemes2ids, piper_phonemes_to_ids) and should be moved to a new test file; extract the logic into tests/test_phoneme_ids.py as pytest functions that mock or skip when dependencies/files are missing, replace hardcoded paths with temporary fixtures or package resources, import and call load_phoneme_ids/load_phoneme_map and the phonemizer inside tests (or mock EspeakPhonemizer), and remove or reduce the __main__ block to a minimal example or delete it so the module has no heavy side-effect code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@phoonnx/phoneme_ids.py`:
- Line 242: The assignment to bos_id incorrectly uses eos_token; update the
expression so bos_id is derived from bos_token (i.e., bos_id = bos_token if
isinstance(bos_token, int) ...), and keep the existing fallback to look up
token2id[bos_token] when bos_token is a string; modify the line that sets bos_id
(and ensure it parallels the eos_id logic) so BOS and EOS IDs are computed from
their respective inputs.
- Around line 335-341: The parsing in load_phoneme_ids is brittle: change the
logic to split on any whitespace and handle single-token numeric lines as
ID-only (whitespace phoneme) without relying on a literal space; specifically,
use a whitespace split (e.g., line.rstrip().split(None, 1)) to get tokens, then
if len(tokens) == 1 and tokens[0].isdigit() set phoneme_id = int(tokens[0]) and
phoneme_str = " " (or the intended whitespace marker), else if len(tokens) == 2
parse phoneme_id = int(tokens[0]) and phoneme_str = tokens[1]; remove the
phoneme_str.isdigit() swap so numeric phoneme tokens are not incorrectly
flipped. Ensure conversions to int are guarded/validated and refer to
load_phoneme_ids, phoneme_id, and phoneme_str when making the fixes.
---
Nitpick comments:
In `@phoonnx/phoneme_ids.py`:
- Around line 378-457: The inline __main__ test harness (starting at the
module's bottom) uses hardcoded paths and external libs (phoneme_ids_path,
phoneme_map_path, load_phoneme_ids, EspeakPhonemizer, phonemes_to_ids,
mimic3_phonemes2ids, piper_phonemes_to_ids) and should be moved to a new test
file; extract the logic into tests/test_phoneme_ids.py as pytest functions that
mock or skip when dependencies/files are missing, replace hardcoded paths with
temporary fixtures or package resources, import and call
load_phoneme_ids/load_phoneme_map and the phonemizer inside tests (or mock
EspeakPhonemizer), and remove or reduce the __main__ block to a minimal example
or delete it so the module has no heavy side-effect code.
In `@phoonnx/voice.py`:
- Around line 315-317: The type hints for _blank_id, _bos_id, and _eos_id are
incorrect: they are declared as int but _tok.blank_id, _vocab.bos_id and
_vocab.eos_id return Optional[int]; update the annotations to Optional[int] (or
None | int) for _blank_id, _bos_id, and _eos_id so the hints match the actual
values and existing comparison logic (in the region around the checks that
compare these IDs) remains correct.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 712a8b41-f132-45e7-ad83-7f7eefad9aeb
📒 Files selected for processing (7)
docs/README.mddocs/alignment.mdphoonnx/config.pyphoonnx/phoneme_ids.pyphoonnx/voice.pyphoonnx_train/export_onnx.pytests/test_alignment.py
| eos_id = eos_token if isinstance(eos_token, int) \ | ||
| else id_map.get(eos_token, [len(id_map)]) if eos_token \ | ||
| else [len(id_map)] | ||
| bos_id = eos_token if isinstance(bos_token, int) \ |
There was a problem hiding this comment.
Critical typo: bos_id assigned from eos_token instead of bos_token.
Line 242 reads bos_id = eos_token if isinstance(bos_token, int) ... but should use bos_token in the conditional expression. This causes BOS and EOS to have identical IDs when bos_token is passed as an integer or when looked up from the map.
🐛 Proposed fix
- bos_id = eos_token if isinstance(bos_token, int) \
- else id_map.get(bos_token, [len(id_map)]) if bos_token \
+ bos_id = bos_token if isinstance(bos_token, int) \
+ else id_map.get(bos_token, [len(id_map)]) if bos_token \
else [len(id_map)]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| bos_id = eos_token if isinstance(bos_token, int) \ | |
| bos_id = bos_token if isinstance(bos_token, int) \ | |
| else id_map.get(bos_token, [len(id_map)]) if bos_token \ | |
| else [len(id_map)] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@phoonnx/phoneme_ids.py` at line 242, The assignment to bos_id incorrectly
uses eos_token; update the expression so bos_id is derived from bos_token (i.e.,
bos_id = bos_token if isinstance(bos_token, int) ...), and keep the existing
fallback to look up token2id[bos_token] when bos_token is a string; modify the
line that sets bos_id (and ensure it parallels the eos_id logic) so BOS and EOS
IDs are computed from their respective inputs.
| if line.strip().isdigit(): # phoneme is whitespace | ||
| phoneme_str = " " | ||
| phoneme_id = int(line) | ||
| else: | ||
| phoneme_id, phoneme_str = line.split(" ", maxsplit=1) | ||
| if phoneme_str.isdigit(): | ||
| phoneme_id, phoneme_str = phoneme_str, phoneme_id |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Test load_phoneme_ids with various whitespace and format edge cases
python3 << 'EOF'
from io import StringIO
from phoonnx.phoneme_ids import load_phoneme_ids
# Test 1: whitespace phoneme (ID-only line)
test1 = "3\n"
result1 = load_phoneme_ids(StringIO(test1))
print("Test 1 (whitespace phoneme):", result1)
# Test 2: reversed format
test2 = "a 5\n"
result2 = load_phoneme_ids(StringIO(test2))
print("Test 2 (normal format):", result2)
# Test 3: what happens with "5 a" - should reverse?
test3 = "5 a\n"
result3 = load_phoneme_ids(StringIO(test3))
print("Test 3 (reversed format):", result3)
# Test 4: ID with trailing space before newline
test4 = "3 \n"
result4 = load_phoneme_ids(StringIO(test4))
print("Test 4 (ID + trailing space):", result4)
EOFRepository: TigreGotico/phoonnx
Length of output: 202
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show the relevant section around the snippet in phoonnx/phoneme_ids.py
sed -n '280,380p' phoonnx/phoneme_ids.py
# Find where whitespace phoneme_str " " is handled/added or skipped
rg -n --fixed-string 'phoneme_str' phoonnx/phoneme_ids.py
rg -n --fixed-string '" "' phoonnx/phoneme_ids.py
rg -n --fixed-string 'isdigit()' phoonnx/phoneme_ids.py
# Show the full load_phoneme_ids function header + body region (best-effort)
python3 - <<'PY'
import inspect
import phoonnx.phoneme_ids as m
print(inspect.getsource(m.load_phoneme_ids))
PYRepository: TigreGotico/phoonnx
Length of output: 3735
🏁 Script executed:
#!/bin/bash
set -euo pipefail
nl -ba phoonnx/phoneme_ids.py | sed -n '300,380p'
python3 <<'PY'
from io import StringIO
from phoonnx.phoneme_ids import load_phoneme_ids
def run(name, s):
try:
out = load_phoneme_ids(StringIO(s))
print(f"{name}: {out}")
except Exception as e:
print(f"{name}: EXC {type(e).__name__}: {e}")
# whitespace-phoneme line variants
run("T1: '3\\n'", "3\n") # expected to be skipped by (" " not in line)
run("T2: '3 \\n'", "3 \n") # passes guard; should map ' ':3
run("T3: '3\\t\\n'", "3\t\n") # no literal space; should be skipped
run("T4: '3 \\n'", "3 \n") # literal spaces present; should map ' ':3
# reversed/regular formats
run("T5: 'a 5\\n'", "a 5\n") # should parse
run("T6: '5 a\\n'", "5 a\n") # should parse
# malformed / potentially crashing
run("T7: '3 a \\n'", "3 a \n") # trailing spaces in phoneme
run("T8: '3 `#x`\\n'", "3 `#x`\n") # int('3 `#x`') would crash if reached
run("T9: 'a 5 extra\\n'", "a 5 extra\n") # phoneme_str='5 extra' => not digit, stored as key
PYRepository: TigreGotico/phoonnx
Length of output: 106
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '300,380p' phoonnx/phoneme_ids.py
python3 <<'PY'
from io import StringIO
from phoonnx.phoneme_ids import load_phoneme_ids
def run(name, s):
try:
out = load_phoneme_ids(StringIO(s))
print(f"{name}: {out}")
except Exception as e:
print(f"{name}: EXC {type(e).__name__}: {e}")
# whitespace-phoneme line variants
run("T1 '3\\n'", "3\n") # no literal space in line => skipped
run("T2 '3 \\n'", "3 \n") # literal space => parses as phoneme ' '
run("T3 '3\\t\\n'", "3\t\n") # tab => no literal space => skipped
run("T4 '3 \\n'", "3 \n") # literal spaces => phoneme key ' '
# reversed/regular formats
run("T5 'a 5\\n'", "a 5\n")
run("T6 '5 a\\n'", "5 a\n")
# malformed / edge cases
run("T7 '3 a \\n'", "3 a \n") # phoneme_str includes spaces => key may include trailing spaces
run("T8 '3 `#x`\\n'", "3 `#x`\n") # currently would attempt int(...) if reached => likely crash
run("T9 'a 5 extra\\n'", "a 5 extra\n") # phoneme_str not just digits => stored key '5 extra'
PYRepository: TigreGotico/phoonnx
Length of output: 2783
Fix whitespace-phoneme ID-only parsing (and numeric-token swapping) in load_phoneme_ids
- The
line.strip().isdigit()“phoneme is whitespace” branch can’t run for digit-only lines like"3\n"because earlier the loader skips any line where" "(literal space) is not present; as a result"3\n"produced{}, while"3 \n"produced{' ': 3}(tabs after the ID are also skipped). - The
if phoneme_str.isdigit(): phoneme_id, phoneme_str = phoneme_str, phoneme_idswap can misparse numeric phoneme tokens: e.g."5 3\n"would be treated as reversed and inverted, even though the spec format isID PHONEME.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@phoonnx/phoneme_ids.py` around lines 335 - 341, The parsing in
load_phoneme_ids is brittle: change the logic to split on any whitespace and
handle single-token numeric lines as ID-only (whitespace phoneme) without
relying on a literal space; specifically, use a whitespace split (e.g.,
line.rstrip().split(None, 1)) to get tokens, then if len(tokens) == 1 and
tokens[0].isdigit() set phoneme_id = int(tokens[0]) and phoneme_str = " " (or
the intended whitespace marker), else if len(tokens) == 2 parse phoneme_id =
int(tokens[0]) and phoneme_str = tokens[1]; remove the phoneme_str.isdigit()
swap so numeric phoneme tokens are not incorrectly flipped. Ensure conversions
to int are guarded/validated and refer to load_phoneme_ids, phoneme_id, and
phoneme_str when making the fixes.
Replaces / supersedes #41 (
fix/alignment).Rebases the 3 phoneme-alignment commits from
fix/alignmentonto currentdev(which already has #124 merged), resolves the conflicts, then adds:
What's in this PR
From #41 (cherry-picked):
PhonemeAlignmentdataclass (phoneme,num_samples)AudioChunkextended withphonemes,phoneme_ids,phoneme_id_samples,phoneme_alignmentsTTSVoice.synthesize(include_alignments=False)— optional alignment reconstructionTTSVoice.phoneme_ids_to_audio(include_alignments=False)— returns(audio, samples)tuple when model has a second outputVoiceConfig.hop_length(default 256) — frame→sample conversionphoonnx_train export-onnx --add-phoneme-alignmentflagNew in this PR:
tests/test_alignment.py— 24 hermetic unit tests (no models, no network):PhonemeAlignmentfields and equalityAudioChunknew fields, optional defaultshop_lengthconstant,VoiceConfigfield default,from_dictparsingphoneme_ids_to_audioreturn shapes (ndarray vs tuple) and hop scalingsynthesizeintegration: fields None when not requested, populated when supporteddocs/alignment.md— usage guide coveringinclude_alignments,AudioChunkfields,PhonemeAlignment,hop_length, export CLI, viseme and karaoke examplesdocs/README.md— linked to new alignment page, added feature bulletLicense note (from #41)
The alignment logic is ported from piper (OHF-Voice/piper1-gpl commit eb6be6b).
@synesthesiam confirmed in the piper1-gpl discussions that piper code not using
espeak-ng is covered by Apache 2 — this PR does not use espeak-ng code.
Summary by CodeRabbit
Release Notes
New Features
Documentation