E2E RAG for GLM-4-Voice: A Case Study

Implementation for the paper "Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation".

👉 中文版说明

An end-to-end retrieval-augmented generation (E2E RAG) speech dialogue system that enables direct speech-to-text generation with retrieval, bypassing traditional ASR pipelines. Built on top of GLM-4-Voice and SONAR for cross-modal, low-latency interaction.

✨ Overview of the Base Models

GLM-4-Voice

Developed by Zhipu AI, GLM-4-Voice supports: Chinese & English understanding and generation, Real-time streaming dialogue, Customizable tone, emotion, and speech rate. However, GLM-4-Voice lacks knowledge retrieval, limiting its performance on complex QA tasks.

Architecture Highlights:

Tokenizer: Whisper encoder + vector quantization
Decoder: CosyVoice based streaming audio generator
GLM-4-Voice-9B: Speech-aware version of GLM-4-9B

SONAR: Cross-Modal Embedding

SONAR by Meta supports:

Multilingual speech/text input
Speech-text joint embedding in the same space
Fine-grained retrieval and alignment

Used in this project for cross-modal retrieval-augmented generation (RAG).

Additional: Other speech-to-text embeddingers

CLAP is a multimodal contrastive learning model for aligning audio and text by mapping them into a shared semantic space, by LAION.

Supported ASR Backends

Our system supports multiple ASR (Automatic Speech Recognition) backends, which can be switched via command-line arguments, to flexibly balance accuracy, efficiency, and language coverage.

Qwen-Omni ✔️

We support Qwen2.5-Omni for multimodal experiments.

git clone https://github.com/QwenLM/Qwen2.5-Omni.git

python examples/glm_voice_simple --chatbot qwen-omni

🛠️ Environment Setup

Clone the Repository and Create Environmen

cd GLM-Voice-RAG
pip install -e .[jupyter,linux]   # Linux 
# or
pip install -e .[jupyter,non_linux]  # Windows/macOS

Another choice:

conda create -n glm-voice python==3.11
conda activate glm-voice 
pip install -r requirements.txt

Download relatied Checkpoints

sudo apt install git-lfs
git lfs install
git clone https://huggingface.co/THUDM/glm-4-voice-decoder

📚 Dataset

HotpotQA & Speech

git clone https://github.com/hotpotqa/hotpot.git
git clone https://huggingface.co/datasets/the-bird-F/HotpotQA_RGBzh_speech

RGB & Speech

git clone https://github.com/chen700564/RGB.git
git clone https://huggingface.co/datasets/the-bird-F/HotpotQA_RGBzh_speech

Spoken-SQuAD

git clone https://github.com/Chia-Hsuan-Lee/Spoken-SQuAD

VoxPopuli-QA

git clone https://github.com/facebookresearch/voxpopuli

🚀 Quick Start

We provide different running programs for different datasets, where we can choose to run E2E RAG or ASR RAG:

# simple (Your data)
python examples/glm_voice_simple.py --rag e2e

# HotpotQA
python examples/glm_voice_hotpot.py --rag e2e

# RGB
python examples/glm_voice_rgb.py --rag e2e

Additionally, we provide a retrieval augmentation strategy generated in two rounds, which can be run with the following command:

python examples/double_glm_voice_hotpot.py

🙏 Acknowledgements

This project builds upon evaluation work conducted with GLM-4-Voice. The original codebase can be found at:

GLM-4-Voice-test.

📄 License

The use of GLM-4 model weights must comply with the model license.
The code in this repository is released under the Apache 2.0 license.

If you find this project helpful, feel free to ⭐️ Star and 🔁 Fork it!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
answer_data		answer_data
examples		examples
resources		resources
speech_data		speech_data
src		src
third_party/Matcha-TTS		third_party/Matcha-TTS
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

E2E RAG for GLM-4-Voice: A Case Study

✨ Overview of the Base Models

GLM-4-Voice

SONAR: Cross-Modal Embedding

Additional: Other speech-to-text embeddingers

Supported ASR Backends

Qwen-Omni ✔️

🛠️ Environment Setup

📚 Dataset

HotpotQA & Speech

RGB & Speech

Spoken-SQuAD

VoxPopuli-QA

🚀 Quick Start

🙏 Acknowledgements

📄 License

About

Uh oh!

Uh oh!

Languages

License

the-bird-F/GLM-Voice-RAG

Folders and files

Latest commit

History

Repository files navigation

E2E RAG for GLM-4-Voice: A Case Study

✨ Overview of the Base Models

GLM-4-Voice

SONAR: Cross-Modal Embedding

Additional: Other speech-to-text embeddingers

Supported ASR Backends

Qwen-Omni ✔️

🛠️ Environment Setup

📚 Dataset

HotpotQA & Speech

RGB & Speech

Spoken-SQuAD

VoxPopuli-QA

🚀 Quick Start

🙏 Acknowledgements

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages