Skip to content

LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement

Notifications You must be signed in to change notification settings

rezzsl/LLaSE-G1

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement

Paper Demo Hugging Face

LLaSE-G1

Introduction

LLaSE-G1 is a unified speech enhancement model capable of handling multiple tasks without extra task prompts, including:

  • Noise Suppression (SE)
  • Target Speaker Extraction (TSE)
  • Packet Loss Concealment (PLC)
  • Acoustic Echo Cancellation (AEC)
  • Speech Separation (SS)

To mitigate acoustic inconsistency, LLaSE-G1 employs continuous representations from WavLM as input and predicts speech tokens using X-Codec2, maximizing acoustic preservation. The model surpasses prior task-specific discriminative and generative speech enhancement models, demonstrating scaling effects at test time and emerging capabilities for unseen speech enhancement tasks.

For more details, refer to our paper: LLaSE-G1 Paper

Demo

You can listen to the enhancement results on our Demo Page.

Installation

Checkpoints are at huggingface.

1. Clone the repository

git clone https://github.com/your-repo/LLaSE-G1.git
cd LLaSE-G1

2. Create a Conda environment and install dependencies

conda create -n llase python=3.10
conda activate llase
pip install -r requirements.txt

3. Download Pretrained Models

LLaSE-G1 requires three additional pre-trained models and checkpoint of the middle LM on Huggingface to function properly. You can download three of them using the provided shell script:

cd ckpt
bash download.sh

Additionally, download WavLM-Large.pt from this URL and put it at ./ckpt/WavLM-Large.pt .

Alternatively, you can download them manually and place them in the ./ckpt/ directory.

After Downloading, the tree should be like this:

├── ckpt
│ ├── codec_ckpt
│ │ ├── epoch=4-step=1400000.ckpt
│ │ └── hub
│ │ ├── models--facebook--w2v-bert-2.0
│ │ │ ├── config.json
│ │ │ ├── model.safetensors
│ │ │ └── preprocessor_config.json
│ │ └── version.txt
│ ├── download_ckpt.py
│ ├── download.sh
│ ├── model.pt.tar
│ └── WavLM-Large.pt

Inference

The main inference script is inference.py. The inference process consists of two stages:

  1. Extract the 6th-layer features from WavLM.
  2. Use the language model (LM) to predict speech tokens, and then decode them into audio using X-Codec2.

Running Inference

To run inference, configure the parameters in ./config/test.yml:

Parameter Description
infer_feat_too Whether to extract WavLM features during inference.
inference_time Number of inference iterations.
feat_dir Directory containing extracted features.
wav_dir Directory of processed audio files.
task Task type: SE (Noise Suppression), TSE (Target Speaker Extraction), PLC (Packet Loss Concealment), AEC (Acoustic Echo Cancellation), SS (Speech Separation).
filename It should be the path of a text file, which contains the paths of the audio files you want to process. For example: /home/0.wav

Command to run inference:

bash inference.sh

Results

Samples processed by LLaSE-G1 can be found on our Demo Page.

Model Checkpoints

Our pretrained model is available on Hugging Face.

Hints

Our approach focuses on leveraging the LLM's comprehension capabilities to enable autonomous determination of task types, though this may exhibit instability in certain scenarios. A more stable and robust iteration will be released in the upcoming version.

Citation

@misc{kang2025llaseg1incentivizinggeneralizationcapability,
      title={LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement}, 
      author={Boyi Kang and Xinfa Zhu and Zihan Zhang and Zhen Ye and Mingshuai Liu and Ziqian Wang and Yike Zhu and Guobin Ma and Jun Chen and Longshuai Xiao and Chao Weng and Wei Xue and Lei Xie},
      year={2025},
      eprint={2503.00493},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2503.00493}, 
}

Contact

For any questions, please contact: [email protected] image

About

LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • Shell 0.4%