Skip to content

A complete cross-modal RAG system for end-to-end speech-to-speech large models, including ASR-based Retrieval and E2E Retrieval.

License

Notifications You must be signed in to change notification settings

the-bird-F/GLM-Voice-RAG

Repository files navigation

E2E RAG for GLM-4-Voice: A Case Study

Implementation for the paper "Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation".

👉 中文版说明

Model Architecture

An end-to-end retrieval-augmented generation (E2E RAG) speech dialogue system that enables direct speech-to-text generation with retrieval, bypassing traditional ASR pipelines. Built on top of GLM-4-Voice and SONAR for cross-modal, low-latency interaction.


✨ Overview of the Base Models

GLM-4-Voice

Developed by Zhipu AI, GLM-4-Voice supports: Chinese & English understanding and generation, Real-time streaming dialogue, Customizable tone, emotion, and speech rate. However, GLM-4-Voice lacks knowledge retrieval, limiting its performance on complex QA tasks.

Architecture Highlights:

  • Tokenizer: Whisper encoder + vector quantization
  • Decoder: CosyVoice based streaming audio generator
  • GLM-4-Voice-9B: Speech-aware version of GLM-4-9B

SONAR: Cross-Modal Embedding

SONAR by Meta supports:

  • Multilingual speech/text input
  • Speech-text joint embedding in the same space
  • Fine-grained retrieval and alignment

Used in this project for cross-modal retrieval-augmented generation (RAG).


Additional: Other speech-to-text embeddingers

CLAP is a multimodal contrastive learning model for aligning audio and text by mapping them into a shared semantic space, by LAION.


Supported ASR Backends

Our system supports multiple ASR (Automatic Speech Recognition) backends, which can be switched via command-line arguments, to flexibly balance accuracy, efficiency, and language coverage.


Qwen-Omni ✔️

We support Qwen2.5-Omni for multimodal experiments.

git clone https://github.com/QwenLM/Qwen2.5-Omni.git

python examples/glm_voice_simple --chatbot qwen-omni

🛠️ Environment Setup

  1. Clone the Repository and Create Environmen

    cd GLM-Voice-RAG
    pip install -e .[jupyter,linux]   # Linux 
    # or
    pip install -e .[jupyter,non_linux]  # Windows/macOS 

    Another choice:

    conda create -n glm-voice python==3.11
    conda activate glm-voice 
    pip install -r requirements.txt
  2. Download relatied Checkpoints

    sudo apt install git-lfs
    git lfs install
    git clone https://huggingface.co/THUDM/glm-4-voice-decoder

📚 Dataset

HotpotQA & Speech

git clone https://github.com/hotpotqa/hotpot.git
git clone https://huggingface.co/datasets/the-bird-F/HotpotQA_RGBzh_speech

RGB & Speech

git clone https://github.com/chen700564/RGB.git
git clone https://huggingface.co/datasets/the-bird-F/HotpotQA_RGBzh_speech

Spoken-SQuAD

git clone https://github.com/Chia-Hsuan-Lee/Spoken-SQuAD

VoxPopuli-QA

git clone https://github.com/facebookresearch/voxpopuli

🚀 Quick Start

We provide different running programs for different datasets, where we can choose to run E2E RAG or ASR RAG:

# simple (Your data)
python examples/glm_voice_simple.py --rag e2e

# HotpotQA
python examples/glm_voice_hotpot.py --rag e2e

# RGB
python examples/glm_voice_rgb.py --rag e2e

Additionally, we provide a retrieval augmentation strategy generated in two rounds, which can be run with the following command:

python examples/double_glm_voice_hotpot.py

🙏 Acknowledgements

This project builds upon evaluation work conducted with GLM-4-Voice. The original codebase can be found at:

📄 License

  • The use of GLM-4 model weights must comply with the model license.

  • The code in this repository is released under the Apache 2.0 license.

If you find this project helpful, feel free to ⭐️ Star and 🔁 Fork it!

About

A complete cross-modal RAG system for end-to-end speech-to-speech large models, including ASR-based Retrieval and E2E Retrieval.

Topics

Resources

License

Stars

Watchers

Forks

Languages