fix: prevent sample loss in chunked_decode when decoder returns fewer frames by haosenwang1018 · Pull Request #234 · QwenLM/Qwen3-TTS

haosenwang1018 · 2026-02-24T18:24:28Z

Problem

chunked_decode() drops audio samples at chunk boundaries, causing:

Audible popping sounds between chunks
Shorter output audio (e.g. 8s instead of 11s with chunk_size=1)

The bug is in the context removal logic: wav_chunk[..., context_size * total_upsample :] trims from the front, but when the decoder returns fewer samples than codes_chunk_length * total_upsample, this over-trims into actual audio data.

Fixes #223

Fix

Instead of trimming a fixed number of samples from the front, calculate the expected sample count for the current chunk ((end_index - start_index) * total_upsample) and take that many samples from the end. This correctly excludes the left context regardless of the actual decoded length.

# Before (buggy):
wavs.append(wav_chunk[..., context_size * self.total_upsample :])

# After (fixed):
sample_count = min((end_index - start_index) * self.total_upsample, wav_chunk.shape[-1])
wavs.append(wav_chunk[..., -sample_count:])

Reproduction

Set chunk_size=1 in the tokenizer encode/decode example from the README. Before this fix, the output is significantly shorter with audible pops.

The speaker_encoder weights were explicitly deleted from the state dict before saving checkpoints (lines 150-153). When resuming training from a checkpoint, model.speaker_encoder becomes None, causing a crash on the first forward pass. Keep speaker_encoder in checkpoints so that training can resume correctly. Users who want smaller inference-only models can strip these weights separately. Fixes QwenLM#204 (bug 1)

The extract_mels method crashes with an assertion error when audio is not exactly 24kHz. Since librosa is already imported and used for loading, automatically resample to 24kHz instead of failing. This gives users a clear, working path without requiring manual audio preprocessing. librosa.resample uses high-quality resampling by default. Related to QwenLM#204 (bug 2)

… frames When the decoder returns fewer samples than expected for a chunk, the context removal (trimming context_size * total_upsample from the front) can over-trim into actual audio data. This causes audible pops at chunk boundaries and shorter output files. Fix: instead of trimming from the front, calculate the expected sample count for the current chunk and take that many samples from the end. This correctly excludes the context regardless of the actual decoded length. Fixes QwenLM#223

- QwenLM#234: Fix sample loss in chunked_decode — the context removal logic over-trims audio at chunk boundaries, causing shorter output and audible pops. Take expected sample count from the end instead of trimming a fixed amount from the front. - QwenLM#126: Fix tuple item assignment in _normalize_audio_inputs — mutating a tuple element silently fails, breaking stereo-to-mono conversion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

haosenwang1018 added 3 commits February 24, 2026 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent sample loss in chunked_decode when decoder returns fewer frames#234

fix: prevent sample loss in chunked_decode when decoder returns fewer frames#234
haosenwang1018 wants to merge 3 commits intoQwenLM:mainfrom
haosenwang1018:fix/chunked-decode-sample-loss

haosenwang1018 commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haosenwang1018 commented Feb 24, 2026

Problem

Fix

Reproduction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant