fix: prevent sample loss in chunked_decode when decoder returns fewer frames#234
Open
haosenwang1018 wants to merge 3 commits intoQwenLM:mainfrom
Open
fix: prevent sample loss in chunked_decode when decoder returns fewer frames#234haosenwang1018 wants to merge 3 commits intoQwenLM:mainfrom
haosenwang1018 wants to merge 3 commits intoQwenLM:mainfrom
Conversation
The speaker_encoder weights were explicitly deleted from the state dict before saving checkpoints (lines 150-153). When resuming training from a checkpoint, model.speaker_encoder becomes None, causing a crash on the first forward pass. Keep speaker_encoder in checkpoints so that training can resume correctly. Users who want smaller inference-only models can strip these weights separately. Fixes QwenLM#204 (bug 1)
The extract_mels method crashes with an assertion error when audio is not exactly 24kHz. Since librosa is already imported and used for loading, automatically resample to 24kHz instead of failing. This gives users a clear, working path without requiring manual audio preprocessing. librosa.resample uses high-quality resampling by default. Related to QwenLM#204 (bug 2)
… frames When the decoder returns fewer samples than expected for a chunk, the context removal (trimming context_size * total_upsample from the front) can over-trim into actual audio data. This causes audible pops at chunk boundaries and shorter output files. Fix: instead of trimming from the front, calculate the expected sample count for the current chunk and take that many samples from the end. This correctly excludes the context regardless of the actual decoded length. Fixes QwenLM#223
guybrush1984
added a commit
to guybrush1984/Qwen3-TTS
that referenced
this pull request
Mar 4, 2026
- QwenLM#234: Fix sample loss in chunked_decode — the context removal logic over-trims audio at chunk boundaries, causing shorter output and audible pops. Take expected sample count from the end instead of trimming a fixed amount from the front. - QwenLM#126: Fix tuple item assignment in _normalize_audio_inputs — mutating a tuple element silently fails, breaking stereo-to-mono conversion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
chunked_decode()drops audio samples at chunk boundaries, causing:chunk_size=1)The bug is in the context removal logic:
wav_chunk[..., context_size * total_upsample :]trims from the front, but when the decoder returns fewer samples thancodes_chunk_length * total_upsample, this over-trims into actual audio data.Fixes #223
Fix
Instead of trimming a fixed number of samples from the front, calculate the expected sample count for the current chunk (
(end_index - start_index) * total_upsample) and take that many samples from the end. This correctly excludes the left context regardless of the actual decoded length.Reproduction
Set
chunk_size=1in the tokenizer encode/decode example from the README. Before this fix, the output is significantly shorter with audible pops.