TigreGotico · JarbasAl · Oct 7, 2025 · Oct 7, 2025 · Oct 7, 2025 · May 30, 2026
diff --git a/docs/README.md b/docs/README.md
@@ -8,6 +8,7 @@
 - **Multilingual phonemization** — 30+ phonemizer backends for dozens of languages
 - **Multi-engine support** — load voices from Piper, Mimic3, Coqui, Transformers, and native phoonnx format
 - **Voice manager** — download and cache models from HuggingFace and other sources
+- **Phoneme alignment** — optional per-phoneme timing for visemes, lip-sync, and karaoke
 - **Training pipeline** — preprocess datasets and train new VITS voices (`phoonnx_train`)
 - **OVOS plugin** — drop-in TTS plugin for the OpenVoiceOS / Mycroft ecosystem
 
@@ -43,6 +44,7 @@ with wave.open("output.wav", "wb") as wav_file:
 
 - [Installation](installation.md)
 - [Usage Guide](usage.md)
+- [Phoneme Alignment](alignment.md)
 - [Voice Manager](voice_manager.md)
 - [Phonemizers](phonemizers.md)
 - [Configuration Reference](configuration.md)

diff --git a/docs/alignment.md b/docs/alignment.md
@@ -0,0 +1,155 @@
+# Phoneme Alignment
+
+Phoneme alignment gives you per-phoneme timing: how many audio samples each phoneme
+occupies in the synthesized output. This is the foundation for visemes (lip-sync),
+karaoke-style word highlighting, and subtitle generation.
+
+> **Model support required.** Alignment is an optional second output of the ONNX
+> model. Standard exported models do **not** include it. See
+> [Exporting a model with alignment support](#exporting-a-model-with-alignment-support)
+> below.
+
+---
+
+## Getting alignments from `synthesize()`
+
+Pass `include_alignments=True` to `TTSVoice.synthesize()`:
+
+```python
+from phoonnx.voice import TTSVoice
+
+voice = TTSVoice.load("model-aligned.onnx")
+
+for chunk in voice.synthesize("Hello world.", include_alignments=True):
+    print(f"phonemes : {chunk.phonemes}")
+    print(f"phoneme ids: {chunk.phoneme_ids}")
+
+    if chunk.phoneme_alignments:
+        for align in chunk.phoneme_alignments:
+            duration_ms = align.num_samples / chunk.sample_rate * 1000
+            print(f"  {align.phoneme!r:6s}  {duration_ms:6.1f} ms")
+    else:
+        # Model does not expose alignment output, or reconstruction failed
+        print("  (no alignment available)")
+```
+
+### `AudioChunk` alignment fields
+
+| Field | Type | Description |
+|---|---|---|
+| `phonemes` | `list[str]` | Phoneme tokens for this sentence |
+| `phoneme_ids` | `list[int]` | Integer IDs passed to the ONNX model |
+| `phoneme_id_samples` | `np.ndarray \| None` | Raw sample counts per phoneme ID (from model) |
+| `phoneme_alignments` | `list[PhonemeAlignment] \| None` | Reconstructed per-phoneme timings |
+
+`phoneme_alignments` is `None` when:
+- `include_alignments=False` (default)
+- The model has only one output (does not support alignment)
+- The alignment reconstruction fails (ID sequence mismatch)
+
+`phonemes` and `phoneme_ids` are always populated regardless of `include_alignments`.
+
+### `PhonemeAlignment`
+
+```python
+@dataclass
+class PhonemeAlignment:
+    phoneme: str      # the phoneme token, e.g. "h", "ɛ", "l"
+    num_samples: int  # number of PCM samples occupied by this phoneme
+```
+
+Convert `num_samples` to milliseconds: `num_samples / chunk.sample_rate * 1000`.
+
+---
+
+## Lower-level: `phoneme_ids_to_audio()`
+
+If you are working at the phoneme-ID level you can call the method directly:
+
+```python
+audio_or_tuple = voice.phoneme_ids_to_audio(phoneme_ids, include_alignments=True)
+
+if isinstance(audio_or_tuple, tuple):
+    audio, phoneme_id_samples = audio_or_tuple
+    # phoneme_id_samples[i] = samples for phoneme_ids[i], or None if unsupported
+else:
+    audio = audio_or_tuple  # include_alignments=False
+```
+
+---
+
+## `hop_length`
+
+The raw model output is a duration in frames; phoonnx converts frames to samples
+using `hop_length` (default **256**, matching the standard VITS vocoder hop size).
+
+Override it in the voice config JSON:
+
+```json
+{
+  "hop_length": 256
+}
+```
+
+Or via `VoiceConfig`:
+
+```python
+voice.config.hop_length = 256
+```
+
+---
+
+## Exporting a model with alignment support
+
+Standard VITS models expose only the audio tensor. To expose the phoneme-duration
+tensor as a second output, use the `--add-phoneme-alignment` flag when exporting:
+
+```bash
+phoonnx-train export-onnx checkpoint.ckpt -c config.json --add-phoneme-alignment
+```
+
+This modifies the exported `.onnx` graph to surface the `Ceil` node output (phoneme
+durations) as a named model output. The modification is done by
+`add_phoneme_alignment_output()` in `phoonnx_train/export_onnx.py`.
+
+You can also apply it post-hoc to an already-exported model:
+
+```python
+from phoonnx_train.export_onnx import add_phoneme_alignment_output
+from pathlib import Path
+
+add_phoneme_alignment_output(
+    model_path=Path("model.onnx"),
+    output_path=Path("model-aligned.onnx"),  # omit to overwrite in place
+    tensor_name="autodetect",                # or pass the tensor name explicitly
+)
+```
+
+> **Compatibility note.** Adding the alignment output may break third-party
+> frameworks (e.g. Piper) that expect a single output tensor. Keep a separate
+> copy of the model for standard TTS use.
+
+---
+
+## Use cases
+
+### Visemes / lip-sync
+
+Map each phoneme to a viseme index, then schedule face-rig blend-shape changes
+at the sample offset accumulated from `num_samples`:
+
+```python
+VISEME = {"p": 0, "b": 0, "m": 0, "f": 1, "v": 1, ...}  # your mapping
+
+offset = 0
+for align in chunk.phoneme_alignments or []:
+    t = offset / chunk.sample_rate
+    viseme = VISEME.get(align.phoneme, -1)
+    schedule_viseme(t, viseme)
+    offset += align.num_samples
+```
+
+### Karaoke / subtitle highlighting
+
+Accumulate sample offsets to get word-start timestamps, then use them to
+synchronise text highlights with audio playback.
diff --git a/phoonnx/config.py b/phoonnx/config.py
@@ -10,6 +10,7 @@
 DEFAULT_NOISE_SCALE = 0.667
 DEFAULT_LENGTH_SCALE = 1.0
 DEFAULT_NOISE_W_SCALE = 0.8
+DEFAULT_HOP_LENGTH = 256
 
 
 class Engine(str, Enum):
@@ -121,6 +122,8 @@
     noise_w_scale: float = DEFAULT_NOISE_W_SCALE
     add_diacritics: bool = None # arabic and hebrew
 
+    hop_length: int = DEFAULT_HOP_LENGTH
+
     # tokenization settings
     tokenizer: Optional[TTSTokenizer] = None
     blank_at_start: bool = True
@@ -383,6 +386,7 @@
             length_scale=inference.get("length_scale", DEFAULT_LENGTH_SCALE),
             noise_w_scale=inference.get("noise_w", DEFAULT_NOISE_W_SCALE),
             add_diacritics=diacritics,
+            hop_length=config.get("hop_length", DEFAULT_HOP_LENGTH),
             lang_code=lang_code,
             alphabet=Alphabet(alphabet) if isinstance(alphabet, str) else alphabet,
             engine=Engine(engine) if isinstance(engine, str) else engine,
@@ -433,7 +437,7 @@

 def get_phonemizer(phoneme_type: PhonemeType,
                   alphabet: Alphabet = Alphabet.IPA,
                   model: Optional[str] = None) -> 'Phonemizer':
    """
    Create a phonemizer instance for the specified phonemeization strategy.