Streaming inference implementation for Qwen3-TTS that the official repo doesn't provide.
The official team mentions "Extreme Low-Latency Streaming Generation" in their paper and marketing, but the actual streaming code was never released - they point users to vLLM-Omni, which still doesn't support online serving.
This fork adds real streaming generation directly to the qwen-tts package.
In addition to real streaming, this fork includes an ~6x inference speedup vs upstream qwen-tts - both for non-streaming generation and streaming mode.
stream_generate_pcm()- real-time PCM audio streamingstream_generate_voice_clone()- streaming with voice cloning
See examples/
Note: torch versions differ between Linux/Windows due to available flash_attn prebuilt wheels.
Linux:
sudo apt install sox libsox-fmt-allWindows:
# Download from https://sourceforge.net/projects/sox/ and add to PATH !!conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-ttsLinux:
pip install torch==2.9.1 torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu130
pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.6.8/flash_attn-2.8.3%2Bcu130torch2.9-cp312-cp312-linux_x86_64.whlWindows:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu130
pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.12/flash_attn-2.8.3%2Bcu130torch2.10-cp312-cp312-win_amd64.whl
pip install -U "triton-windows<3.7"git clone https://github.com/dffdeeq/Qwen3-TTS-streaming.git
cd Qwen3-TTS-streaming
pip install -e .| Parameter | Default | Description |
|---|---|---|
emit_every_frames |
4 | Emit audio every N frames (~0.33s at 12Hz) |
decode_window_frames |
80 | Decoder context window |
From official Qwen3-TTS README:
Now only offline inference is supported. Online serving will be supported later.
This fork provides streaming now, without waiting for vLLM-Omni updates.
Based on QwenLM/Qwen3-TTS