Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
- **Multilingual phonemization** — 30+ phonemizer backends for dozens of languages
- **Multi-engine support** — load voices from Piper, Mimic3, Coqui, Transformers, and native phoonnx format
- **Voice manager** — download and cache models from HuggingFace and other sources
- **Phoneme alignment** — optional per-phoneme timing for visemes, lip-sync, and karaoke
- **Training pipeline** — preprocess datasets and train new VITS voices (`phoonnx_train`)
- **OVOS plugin** — drop-in TTS plugin for the OpenVoiceOS / Mycroft ecosystem

Expand Down Expand Up @@ -43,6 +44,7 @@ with wave.open("output.wav", "wb") as wav_file:

- [Installation](installation.md)
- [Usage Guide](usage.md)
- [Phoneme Alignment](alignment.md)
- [Voice Manager](voice_manager.md)
- [Phonemizers](phonemizers.md)
- [Configuration Reference](configuration.md)
Expand Down
155 changes: 155 additions & 0 deletions docs/alignment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# Phoneme Alignment

Phoneme alignment gives you per-phoneme timing: how many audio samples each phoneme
occupies in the synthesized output. This is the foundation for visemes (lip-sync),
karaoke-style word highlighting, and subtitle generation.

> **Model support required.** Alignment is an optional second output of the ONNX
> model. Standard exported models do **not** include it. See
> [Exporting a model with alignment support](#exporting-a-model-with-alignment-support)
> below.

---

## Getting alignments from `synthesize()`

Pass `include_alignments=True` to `TTSVoice.synthesize()`:

```python
from phoonnx.voice import TTSVoice

voice = TTSVoice.load("model-aligned.onnx")

for chunk in voice.synthesize("Hello world.", include_alignments=True):
print(f"phonemes : {chunk.phonemes}")
print(f"phoneme ids: {chunk.phoneme_ids}")

if chunk.phoneme_alignments:
for align in chunk.phoneme_alignments:
duration_ms = align.num_samples / chunk.sample_rate * 1000
print(f" {align.phoneme!r:6s} {duration_ms:6.1f} ms")
else:
# Model does not expose alignment output, or reconstruction failed
print(" (no alignment available)")
```

### `AudioChunk` alignment fields

| Field | Type | Description |
|---|---|---|
| `phonemes` | `list[str]` | Phoneme tokens for this sentence |
| `phoneme_ids` | `list[int]` | Integer IDs passed to the ONNX model |
| `phoneme_id_samples` | `np.ndarray \| None` | Raw sample counts per phoneme ID (from model) |
| `phoneme_alignments` | `list[PhonemeAlignment] \| None` | Reconstructed per-phoneme timings |

`phoneme_alignments` is `None` when:
- `include_alignments=False` (default)
- The model has only one output (does not support alignment)
- The alignment reconstruction fails (ID sequence mismatch)

`phonemes` and `phoneme_ids` are always populated regardless of `include_alignments`.

### `PhonemeAlignment`

```python
@dataclass
class PhonemeAlignment:
phoneme: str # the phoneme token, e.g. "h", "ɛ", "l"
num_samples: int # number of PCM samples occupied by this phoneme
```

Convert `num_samples` to milliseconds: `num_samples / chunk.sample_rate * 1000`.

---

## Lower-level: `phoneme_ids_to_audio()`

If you are working at the phoneme-ID level you can call the method directly:

```python
audio_or_tuple = voice.phoneme_ids_to_audio(phoneme_ids, include_alignments=True)

if isinstance(audio_or_tuple, tuple):
audio, phoneme_id_samples = audio_or_tuple
# phoneme_id_samples[i] = samples for phoneme_ids[i], or None if unsupported
else:
audio = audio_or_tuple # include_alignments=False
```

---

## `hop_length`

The raw model output is a duration in frames; phoonnx converts frames to samples
using `hop_length` (default **256**, matching the standard VITS vocoder hop size).

Override it in the voice config JSON:

```json
{
"hop_length": 256
}
```

Or via `VoiceConfig`:

```python
voice.config.hop_length = 256
```

---

## Exporting a model with alignment support

Standard VITS models expose only the audio tensor. To expose the phoneme-duration
tensor as a second output, use the `--add-phoneme-alignment` flag when exporting:

```bash
phoonnx-train export-onnx checkpoint.ckpt -c config.json --add-phoneme-alignment
```

This modifies the exported `.onnx` graph to surface the `Ceil` node output (phoneme
durations) as a named model output. The modification is done by
`add_phoneme_alignment_output()` in `phoonnx_train/export_onnx.py`.

You can also apply it post-hoc to an already-exported model:

```python
from phoonnx_train.export_onnx import add_phoneme_alignment_output
from pathlib import Path

add_phoneme_alignment_output(
model_path=Path("model.onnx"),
output_path=Path("model-aligned.onnx"), # omit to overwrite in place
tensor_name="autodetect", # or pass the tensor name explicitly
)
```

> **Compatibility note.** Adding the alignment output may break third-party
> frameworks (e.g. Piper) that expect a single output tensor. Keep a separate
> copy of the model for standard TTS use.

---

## Use cases

### Visemes / lip-sync

Map each phoneme to a viseme index, then schedule face-rig blend-shape changes
at the sample offset accumulated from `num_samples`:

```python
VISEME = {"p": 0, "b": 0, "m": 0, "f": 1, "v": 1, ...} # your mapping

offset = 0
for align in chunk.phoneme_alignments or []:
t = offset / chunk.sample_rate
viseme = VISEME.get(align.phoneme, -1)
schedule_viseme(t, viseme)
offset += align.num_samples
```

### Karaoke / subtitle highlighting

Accumulate sample offsets to get word-start timestamps, then use them to
synchronise text highlights with audio playback.
4 changes: 4 additions & 0 deletions phoonnx/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
DEFAULT_NOISE_SCALE = 0.667
DEFAULT_LENGTH_SCALE = 1.0
DEFAULT_NOISE_W_SCALE = 0.8
DEFAULT_HOP_LENGTH = 256


class Engine(str, Enum):
Expand Down Expand Up @@ -121,6 +122,8 @@
noise_w_scale: float = DEFAULT_NOISE_W_SCALE
add_diacritics: bool = None # arabic and hebrew

hop_length: int = DEFAULT_HOP_LENGTH

# tokenization settings
tokenizer: Optional[TTSTokenizer] = None
blank_at_start: bool = True
Expand Down Expand Up @@ -383,6 +386,7 @@
length_scale=inference.get("length_scale", DEFAULT_LENGTH_SCALE),
noise_w_scale=inference.get("noise_w", DEFAULT_NOISE_W_SCALE),
add_diacritics=diacritics,
hop_length=config.get("hop_length", DEFAULT_HOP_LENGTH),
lang_code=lang_code,
alphabet=Alphabet(alphabet) if isinstance(alphabet, str) else alphabet,
engine=Engine(engine) if isinstance(engine, str) else engine,
Expand Down Expand Up @@ -433,7 +437,7 @@

def get_phonemizer(phoneme_type: PhonemeType,
alphabet: Alphabet = Alphabet.IPA,
model: Optional[str] = None) -> 'Phonemizer':

Check failure on line 440 in phoonnx/config.py

View workflow job for this annotation

GitHub Actions / lint / lint

ruff (F821)

phoonnx/config.py:440:53: F821 Undefined name `Phonemizer`
"""
Create a phonemizer instance for the specified phonemeization strategy.

Expand Down
Loading
Loading