Implementation for the paper "Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation".
An end-to-end retrieval-augmented generation (E2E RAG) speech dialogue system that enables direct speech-to-text generation with retrieval, bypassing traditional ASR pipelines. Built on top of GLM-4-Voice and SONAR for cross-modal, low-latency interaction.
Developed by Zhipu AI, GLM-4-Voice supports: Chinese & English understanding and generation, Real-time streaming dialogue, Customizable tone, emotion, and speech rate. However, GLM-4-Voice lacks knowledge retrieval, limiting its performance on complex QA tasks.
Architecture Highlights:
- Tokenizer: Whisper encoder + vector quantization
- Decoder: CosyVoice based streaming audio generator
- GLM-4-Voice-9B: Speech-aware version of GLM-4-9B
SONAR by Meta supports:
- Multilingual speech/text input
- Speech-text joint embedding in the same space
- Fine-grained retrieval and alignment
Used in this project for cross-modal retrieval-augmented generation (RAG).
CLAP is a multimodal contrastive learning model for aligning audio and text by mapping them into a shared semantic space, by LAION.
Our system supports multiple ASR (Automatic Speech Recognition) backends, which can be switched via command-line arguments, to flexibly balance accuracy, efficiency, and language coverage.
- Whisper (openai/whisper-large-v3)
- Faster-Whisper (faster-whisper)
- MMS (facebook/mms-1b-all)
- Wav2Vec2 (facebook/wav2vec2-base-960h)
We support Qwen2.5-Omni for multimodal experiments.
git clone https://github.com/QwenLM/Qwen2.5-Omni.git
python examples/glm_voice_simple --chatbot qwen-omni
-
Clone the Repository and Create Environmen
cd GLM-Voice-RAG pip install -e .[jupyter,linux] # Linux # or pip install -e .[jupyter,non_linux] # Windows/macOS
Another choice:
conda create -n glm-voice python==3.11 conda activate glm-voice pip install -r requirements.txt
-
Download relatied Checkpoints
sudo apt install git-lfs git lfs install git clone https://huggingface.co/THUDM/glm-4-voice-decoder
git clone https://github.com/hotpotqa/hotpot.git
git clone https://huggingface.co/datasets/the-bird-F/HotpotQA_RGBzh_speech
git clone https://github.com/chen700564/RGB.git
git clone https://huggingface.co/datasets/the-bird-F/HotpotQA_RGBzh_speech
git clone https://github.com/Chia-Hsuan-Lee/Spoken-SQuAD
git clone https://github.com/facebookresearch/voxpopuli
We provide different running programs for different datasets, where we can choose to run E2E RAG or ASR RAG:
# simple (Your data)
python examples/glm_voice_simple.py --rag e2e
# HotpotQA
python examples/glm_voice_hotpot.py --rag e2e
# RGB
python examples/glm_voice_rgb.py --rag e2e
Additionally, we provide a retrieval augmentation strategy generated in two rounds, which can be run with the following command:
python examples/double_glm_voice_hotpot.py
This project builds upon evaluation work conducted with GLM-4-Voice. The original codebase can be found at:
-
The use of GLM-4 model weights must comply with the model license.
-
The code in this repository is released under the Apache 2.0 license.
If you find this project helpful, feel free to ⭐️ Star and 🔁 Fork it!