LLaSE-G1 is a unified speech enhancement model capable of handling multiple tasks without extra task prompts, including:
- Noise Suppression (SE)
- Target Speaker Extraction (TSE)
- Packet Loss Concealment (PLC)
- Acoustic Echo Cancellation (AEC)
- Speech Separation (SS)
To mitigate acoustic inconsistency, LLaSE-G1 employs continuous representations from WavLM as input and predicts speech tokens using X-Codec2, maximizing acoustic preservation. The model surpasses prior task-specific discriminative and generative speech enhancement models, demonstrating scaling effects at test time and emerging capabilities for unseen speech enhancement tasks.
For more details, refer to our paper: LLaSE-G1 Paper
You can listen to the enhancement results on our Demo Page.
Checkpoints are at huggingface.
git clone https://github.com/your-repo/LLaSE-G1.git
cd LLaSE-G1
conda create -n llase python=3.10
conda activate llase
pip install -r requirements.txt
LLaSE-G1 requires three additional pre-trained models and checkpoint of the middle LM on Huggingface to function properly. You can download three of them using the provided shell script:
cd ckpt
bash download.sh
Additionally, download WavLM-Large.pt from this URL and put it at ./ckpt/WavLM-Large.pt
.
Alternatively, you can download them manually and place them in the ./ckpt/
directory.
After Downloading, the tree should be like this:
├── ckpt
│ ├── codec_ckpt
│ │ ├── epoch=4-step=1400000.ckpt
│ │ └── hub
│ │ ├── models--facebook--w2v-bert-2.0
│ │ │ ├── config.json
│ │ │ ├── model.safetensors
│ │ │ └── preprocessor_config.json
│ │ └── version.txt
│ ├── download_ckpt.py
│ ├── download.sh
│ ├── model.pt.tar
│ └── WavLM-Large.pt
The main inference script is inference.py
. The inference process consists of two stages:
- Extract the 6th-layer features from WavLM.
- Use the language model (LM) to predict speech tokens, and then decode them into audio using X-Codec2.
To run inference, configure the parameters in ./config/test.yml
:
Parameter | Description |
---|---|
infer_feat_too |
Whether to extract WavLM features during inference. |
inference_time |
Number of inference iterations. |
feat_dir |
Directory containing extracted features. |
wav_dir |
Directory of processed audio files. |
task |
Task type: SE (Noise Suppression), TSE (Target Speaker Extraction), PLC (Packet Loss Concealment), AEC (Acoustic Echo Cancellation), SS (Speech Separation). |
filename |
It should be the path of a text file, which contains the paths of the audio files you want to process. For example: /home/0.wav |
Command to run inference:
bash inference.sh
Samples processed by LLaSE-G1 can be found on our Demo Page.
Our pretrained model is available on Hugging Face.
Our approach focuses on leveraging the LLM's comprehension capabilities to enable autonomous determination of task types, though this may exhibit instability in certain scenarios. A more stable and robust iteration will be released in the upcoming version.
@misc{kang2025llaseg1incentivizinggeneralizationcapability,
title={LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement},
author={Boyi Kang and Xinfa Zhu and Zihan Zhang and Zhen Ye and Mingshuai Liu and Ziqian Wang and Yike Zhu and Guobin Ma and Jun Chen and Longshuai Xiao and Chao Weng and Wei Xue and Lei Xie},
year={2025},
eprint={2503.00493},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2503.00493},
}
For any questions, please contact: [email protected]