Skip to content

salute-developers/GigaAM

Repository files navigation

GigaAM: the family of open-source acoustic models for speech processing

License: MIT Python 3.10+ arXiv HuggingFace Open In Colab


plot

Latest News


Setup

Requirements

  • Python ≥ 3.10
  • ffmpeg installed and added to your system's PATH

Install the GigaAM Package

# Clone the repository
git clone https://github.com/salute-developers/GigaAM.git
cd GigaAM

# Install the package requirements
pip install -e .

# (optionally) Verify the installation:
pip install -e .[tests]
pytest -v tests/test_loading.py -m partial  # or `-m full` to test all models

GigaAM overview

GigaAM is a Conformer-based foundational model (220-240M parameters) pre-trained on diverse Russian speech data. It serves as the backbone for the entire GigaAM family, enabling state-of-the-art fine-tuned performance in speech recognition and emotion recognition. More information about GigaAM-v1 can be found in our post on Habr. We fine-tuned the GigaAM encoder for ASR using CTC and RNNT decoders. GigaAM family includes three lines of models

Pretrain Method Pretrain (hours) ASR (hours) Available Versions
v1 Wav2vec 2.0 50,000 2,000 v1_ssl, emo, v1_ctc, v1_rnnt
v2 HuBERT–CTC 50,000 2,000 v2_ssl, v2_ctc, v2_rnnt
v3 HuBERT–CTC 700,000 4,000 v3_ssl, v3_ctc, v3_rnnt, v3_e2e_ctc, v3_e2e_rnnt

Where v3_e2e_ctc and v3_e2e_rnnt support punctuation and text normalization.

Model Performance

GigaAM-v3 training incorporates new internal datasets: callcenter, music, speech with atypical characteristics, and voice messages. As a result, the models perform on average 30% better on these new domains while maintaining the same quality as GigaAM-v2 on public benchmarks. In end-to-end ASR comparisons of e2e_ctc and e2e_rnnt against Whisper (judged via independent LLM-as-a-Judge side-by-side) GigaAM models win by an average margin of 70:30. Our emotion recognition model GigaAM-Emo outperforms existing models by 15% Macro F1-Score.

For detailed results, see here.


Usage

Model inference

Note: ASR with .transcribe function is applicable for audio only up to 25 seconds. To enable .transcribe_longform install the additional pyannote.audio dependencies

Longform setup instruction
pip install -e .[longform]
# optionally run longform testing
pip install -e .[tests]
HF_TOKEN=<your hf token> pytest -v tests/test_longform.py

import gigaam

# Load test audio
audio_path = gigaam.utils.download_short_audio()
long_audio_path = gigaam.utils.download_long_audio()

# Audio embeddings
model_name = "v3_ssl"       # Options: `v1_ssl`, `v2_ssl`, `v3_ssl`
model = gigaam.load_model(model_name)
embedding, _ = model.embed_audio(audio_path)
print(embedding)

# ASR
model_name = "v3_e2e_rnnt"  # Options: any model version with suffix `_ctc` or `_rnnt`
model = gigaam.load_model(model_name)
transcription = model.transcribe(audio_path)
print(transcription)

# and long-form ASR
import os
os.environ["HF_TOKEN"] = <HF_TOKEN with read access to "pyannote/segmentation-3.0">
utterances = model.transcribe_longform(long_audio_path)
for utt in utterances:
   transcription, (start, end) = utt["transcription"], utt["boundaries"]
   print(f"[{gigaam.format_time(start)} - {gigaam.format_time(end)}]: {transcription}")

# Emotion recognition
model = gigaam.load_model("emo")
emotion2prob = model.get_probs(audio_path)
print(", ".join([f"{emotion}: {prob:.3f}" for emotion, prob in emotion2prob.items()]))

Loading from Hugging Face

Note: Install requirements from the example.

from transformers import AutoModel

model = AutoModel.from_pretrained("ai-sage/GigaAM-v3", revision="e2e_rnnt", trust_remote_code=True)

ONNX Export and Inference

Note: GPU support can be enabled with pip install onnxruntime-gpu==1.23.* if applicable.

  1. Export the model to ONNX using the model.to_onnx method:

    onnx_dir = "onnx"
    model_version = "v3_ctc"  # Options: any version
    
    model = gigaam.load_model(model_version)
    model.to_onnx(dir_path=onnx_dir)
  2. Run ONNX inference:

    from gigaam.onnx_utils import load_onnx, infer_onnx
    
    sessions, model_cfg = load_onnx(onnx_dir, model_version)
    result = infer_onnx(audio_path, model_cfg, sessions)
    print(result)  # string for ctc / rnnt, np.ndarray for ssl / emo

These and more advanced (e.g. custom audio loading, batching) examples can be found in the Colab notebook.


Citation

If you use GigaAM in your research, please cite our paper:

@inproceedings{kutsakov25_interspeech,
  title     = {{GigaAM: Efficient Self-Supervised Learner for Speech Recognition}},
  author    = {Aleksandr Kutsakov and Alexandr Maximenko and Georgii Gospodinov and Pavel Bogomolov and Fyodor Minkin},
  year      = {2025},
  booktitle = {{Interspeech 2025}},
  pages     = {1213--1217},
  doi       = {10.21437/Interspeech.2025-1616},
  issn      = {2958-1796},
}

Links