Skip to content

fix: prevent sample loss in chunked_decode when decoder returns fewer frames#234

Open
haosenwang1018 wants to merge 3 commits intoQwenLM:mainfrom
haosenwang1018:fix/chunked-decode-sample-loss
Open

fix: prevent sample loss in chunked_decode when decoder returns fewer frames#234
haosenwang1018 wants to merge 3 commits intoQwenLM:mainfrom
haosenwang1018:fix/chunked-decode-sample-loss

Conversation

@haosenwang1018
Copy link
Copy Markdown

Problem

chunked_decode() drops audio samples at chunk boundaries, causing:

  • Audible popping sounds between chunks
  • Shorter output audio (e.g. 8s instead of 11s with chunk_size=1)

The bug is in the context removal logic: wav_chunk[..., context_size * total_upsample :] trims from the front, but when the decoder returns fewer samples than codes_chunk_length * total_upsample, this over-trims into actual audio data.

Fixes #223

Fix

Instead of trimming a fixed number of samples from the front, calculate the expected sample count for the current chunk ((end_index - start_index) * total_upsample) and take that many samples from the end. This correctly excludes the left context regardless of the actual decoded length.

# Before (buggy):
wavs.append(wav_chunk[..., context_size * self.total_upsample :])

# After (fixed):
sample_count = min((end_index - start_index) * self.total_upsample, wav_chunk.shape[-1])
wavs.append(wav_chunk[..., -sample_count:])

Reproduction

Set chunk_size=1 in the tokenizer encode/decode example from the README. Before this fix, the output is significantly shorter with audible pops.

The speaker_encoder weights were explicitly deleted from the state dict
before saving checkpoints (lines 150-153). When resuming training from
a checkpoint, model.speaker_encoder becomes None, causing a crash on
the first forward pass.

Keep speaker_encoder in checkpoints so that training can resume
correctly. Users who want smaller inference-only models can strip
these weights separately.

Fixes QwenLM#204 (bug 1)
The extract_mels method crashes with an assertion error when audio is not
exactly 24kHz. Since librosa is already imported and used for loading,
automatically resample to 24kHz instead of failing.

This gives users a clear, working path without requiring manual audio
preprocessing. librosa.resample uses high-quality resampling by default.

Related to QwenLM#204 (bug 2)
… frames

When the decoder returns fewer samples than expected for a chunk, the
context removal (trimming context_size * total_upsample from the front)
can over-trim into actual audio data. This causes audible pops at chunk
boundaries and shorter output files.

Fix: instead of trimming from the front, calculate the expected sample
count for the current chunk and take that many samples from the end.
This correctly excludes the context regardless of the actual decoded
length.

Fixes QwenLM#223
guybrush1984 added a commit to guybrush1984/Qwen3-TTS that referenced this pull request Mar 4, 2026
- QwenLM#234: Fix sample loss in chunked_decode — the context removal logic
  over-trims audio at chunk boundaries, causing shorter output and
  audible pops. Take expected sample count from the end instead of
  trimming a fixed amount from the front.

- QwenLM#126: Fix tuple item assignment in _normalize_audio_inputs — mutating
  a tuple element silently fails, breaking stereo-to-mono conversion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chunked_decode issue, loss of samples between chunks

1 participant