Skip to content

knoveleng/asr-evalkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ASR EvalKit

A modular, extensible toolkit for evaluating Automatic Speech Recognition (ASR) models. It provides a unified Python API and a Command-Line Interface (CLI) to evaluate HuggingFace, NeMo, and vLLM-backed models across various datasets.

Key Features

  • Unified Provider System: Evaluate transformers, NVIDIA NeMo, or vLLM models seamlessly.
  • Dataset Support: Native support for HuggingFace datasets (with streaming) and local NeMo manifests.
  • Extensible: Effortlessly add new models, metrics, or normalisers via clean abstract base classes.

Installation

1. Install uv (recommended)

uv is a fast Python package manager. Install it with:

curl -LsSf https://astral.sh/uv/install.sh | sh

2. Clone and create a virtual environment

git clone https://github.com/knoveleng/asr-evalkit.git
cd asr-evalkit

# Create a virtual environment named 'asr'
uv venv asr --python 3.12

# Activate it
source asr/bin/activate

3. Install dependencies

# Install the package in editable mode (reads from requirements.txt via setup.py)
uv pip install -e .

# Optional — install NeMo and dev tools
uv pip install -e ".[nemo,dev]"

Note: For vLLM-backed models (seallm_audio, qwen3_asr), install vLLM separately:

uv pip install vllm

Quick Start: Command-Line Interface (CLI)

The easiest way to run evaluations is via the asr-evalkit command.

1. Evaluate a HuggingFace Model on a HuggingFace Dataset

asr-evalkit \
  --evaluator whisper \
  --model openai/whisper-large-v3-turbo \
  --dataset openslr/librispeech_asr \
  --dataset-config clean \
  --dataset-split test \
  --streaming \
  --audio-column audio \
  --text-column text \
  --output-file results.json

2. Evaluate a NeMo Model on a HuggingFace Dataset

asr-evalkit \
  --evaluator canary \
  --model nvidia/canary-1b \
  --dataset openslr/librispeech_asr \
  --dataset-config clean \
  --dataset-split test \
  --streaming \
  --audio-column audio_filepath \
  --text-column text \
  --output-file canary_results.json

Example Scripts

The scripts/ directory contains ready-to-use examples you can copy and adapt:

Script Description
scripts/evaluate_benchmarks.sh Batch benchmark runner — evaluates one or more models across a configurable list of HuggingFace datasets in a single run. Edit the MODELS and BENCHMARKS arrays at the top to select which models and datasets to run.
scripts/run_cli.sh CLI examples — commented-out one-off asr-evalkit invocations for Whisper, MERaLiON, Qwen3-ASR (vLLM), Parakeet (NeMo), and more.
scripts/run.py Python API example — shows all three provider types (HuggingFace, vLLM, NeMo) and both dataset loaders side-by-side with comments.

Running the batch benchmark script:

# Run with defaults (10 samples per dataset, results → results/benchmarks/)
bash scripts/evaluate_benchmarks.sh

# Override sample count and output directory
MAX_SAMPLES=100 OUTPUT_DIR=results/full bash scripts/evaluate_benchmarks.sh

Tip: evaluate_benchmarks.sh is designed to be your personal run script — it is listed in .gitignore so you can freely customise it (add models, change datasets, tweak flags) without affecting the repo.


Quick Start: Python API

The Python API offers the exact same capabilities as the CLI but allows integration into your own scripts.

1. HuggingFace Model & Dataset

from asr_evalkit.evaluators.providers import HuggingFaceProvider
from asr_evalkit.config import ModelConfig, NormalizerConfig
from asr_evalkit.datasets import HuggingFaceDataset
from asr_evalkit.runner import Runner

# Configure Provider (Handles weights + precision)
provider = HuggingFaceProvider(
    model_id="openai/whisper-large-v3-turbo",
    device="cuda",
    use_fp16=True
)

# Set up Runner
model_cfg = ModelConfig(evaluator="whisper", provider=provider)
runner = Runner(model=model_cfg)

# Configure Dataset
hf_dataset = HuggingFaceDataset(
    "openslr/librispeech_asr",
    config="clean",
    split="test",
    audio_column="audio",
    text_column="text",
    streaming=True,
    max_samples=10
)

# Run Evaluation
# Note: Default behaviour (lowercase + Unicode NFC + remove punctuation + remove extra whitespace)
results = runner.run(dataset=hf_dataset, normalizer=NormalizerConfig())

print(f"WER: {results['wer']*100:.2f}%")

2. NeMo Model on a Local Manifest

from asr_evalkit.evaluators.providers import NeMoProvider
from asr_evalkit.config import ModelConfig
from asr_evalkit.datasets import NeMoDataset
from asr_evalkit.runner import Runner

# Use NeMo provider instead
provider = NeMoProvider(model_id="nvidia/parakeet-tdt-0.6b-v3", device="cuda")
runner = Runner(model=ModelConfig(evaluator="parakeet", provider=provider))

# Load a local NeMo dataset manifest
nemo_dataset = NeMoDataset(
    "path/to/test_manifest.jsonl",
    audio_column="audio_filepath",
    text_column="text",
    max_samples=10
)

results = runner.run(dataset=nemo_dataset)

Supported Evaluators & Providers

To see a list of all currently supported models natively built into the package, run:

asr-evalkit --list-evaluators
Evaluator Name Supported Provider Target Models
whisper HuggingFaceProvider openai/whisper-*
canary NeMoProvider nvidia/canary-*
parakeet NeMoProvider nvidia/parakeet-*
qwen_omni HuggingFaceProvider Qwen/Qwen2.5-Omni-*
qwen3_asr VLLMProvider Qwen/Qwen3-ASR-*
meralion HuggingFaceProvider MERaLiON/MERaLiON-*
phi4 HuggingFaceProvider microsoft/Phi-4-multimodal-instruct
seallm_audio VLLMProvider SeaLLMs/SeaLLMs-Audio-*

Benchmark Results

Model aishell1_mandarin (CER) aishell3_mandarin (CER) cv_mandarin (CER) cv_tamil (WER) fleurs_malay (WER) fleurs_mandarin (CER) fleurs_tamil (WER) librispeech_english (WER) mesolitica_malaysian (WER) nsc_singlish (WER) slr127_tamil (WER) slr65_tamil (WER) AVG
whisper-large-v3-turbo 9.64 16.81 17.91 74.50 8.88 10.63 66.90 3.04 28.47 32.02 69.56 58.13 33.04
SeaLLMs-Audio-7B 9.65 9.76 8.68 126.70 26.25 37.09 105.31 94.74 71.34 9.53 138.65 127.24 63.75
Qwen2.5-Omni-3B 28.25 44.55 46.36 318.36 74.69 54.74 311.67 29.21 211.40 34.79 448.82 465.58 172.37
Qwen2.5-Omni-7B 7.33 22.58 14.49 252.06 43.92 16.68 326.43 13.80 158.06 22.96 303.96 239.15 118.45
Qwen3-ASR-0.6B 2.08 2.59 10.06 121.10 18.71 9.75 130.09 2.74 47.29 7.64 129.12 127.00 50.68
Qwen3-ASR-1.7B 1.52 2.08 7.50 139.96 10.87 9.33 147.23 2.31 39.00 6.22 144.49 134.63 53.76
MERaLiON-2-10B-ASR 3.09 4.07 8.83 31.78 8.55 11.99 28.68 2.54 25.90 4.62 22.42 19.29 14.31
polyglot-lion-0.6b 1.93 2.32 6.16 42.16 14.45 9.19 37.68 2.67 24.33 6.09 28.14 23.07 16.52
polyglot-lion-1.7b 1.45 1.86 4.91 39.19 9.98 8.00 37.28 2.10 21.51 5.28 26.83 19.75 14.85

Architecture & Customisation

ASR EvalKit is structured around a strict provider/evaluator separation to maintain clean code and easy API mapping.

If you need to customise the toolkit, please refer to the detailed guides in the docs/ folder:


⚠️ Known Issues

AssertionError: duplicate template name when running outside a virtual environment

If you see this error when using vLLM-backed models (e.g. seallm_audio, qwen3_asr) with the system Python, your torch installation has stale kernel files from a previous upgrade. See Troubleshooting → duplicate template name for the one-line fix.


Citation

If you use ASR EvalKit in your research, please cite:

@software{dang2026asrevalkit,
  author       = {Quy-Anh Dang},
  title        = {ASR EvalKit: A Modular Toolkit for Evaluating Automatic Speech Recognition Models},
  year         = {2026},
  url          = {https://github.com/knoveleng/asr-evalkit},
}

About

A Modular Toolkit for Evaluating Automatic Speech Recognition Models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors