A modular, extensible toolkit for evaluating Automatic Speech Recognition (ASR) models. It provides a unified Python API and a Command-Line Interface (CLI) to evaluate HuggingFace, NeMo, and vLLM-backed models across various datasets.
- Unified Provider System: Evaluate
transformers, NVIDIANeMo, orvLLMmodels seamlessly. - Dataset Support: Native support for HuggingFace
datasets(with streaming) and local NeMo manifests. - Extensible: Effortlessly add new models, metrics, or normalisers via clean abstract base classes.
uv is a fast Python package manager. Install it with:
curl -LsSf https://astral.sh/uv/install.sh | shgit clone https://github.com/knoveleng/asr-evalkit.git
cd asr-evalkit
# Create a virtual environment named 'asr'
uv venv asr --python 3.12
# Activate it
source asr/bin/activate# Install the package in editable mode (reads from requirements.txt via setup.py)
uv pip install -e .
# Optional — install NeMo and dev tools
uv pip install -e ".[nemo,dev]"Note: For vLLM-backed models (
seallm_audio,qwen3_asr), install vLLM separately:uv pip install vllm
The easiest way to run evaluations is via the asr-evalkit command.
asr-evalkit \
--evaluator whisper \
--model openai/whisper-large-v3-turbo \
--dataset openslr/librispeech_asr \
--dataset-config clean \
--dataset-split test \
--streaming \
--audio-column audio \
--text-column text \
--output-file results.jsonasr-evalkit \
--evaluator canary \
--model nvidia/canary-1b \
--dataset openslr/librispeech_asr \
--dataset-config clean \
--dataset-split test \
--streaming \
--audio-column audio_filepath \
--text-column text \
--output-file canary_results.jsonThe scripts/ directory contains ready-to-use examples you can copy and adapt:
| Script | Description |
|---|---|
scripts/evaluate_benchmarks.sh |
Batch benchmark runner — evaluates one or more models across a configurable list of HuggingFace datasets in a single run. Edit the MODELS and BENCHMARKS arrays at the top to select which models and datasets to run. |
scripts/run_cli.sh |
CLI examples — commented-out one-off asr-evalkit invocations for Whisper, MERaLiON, Qwen3-ASR (vLLM), Parakeet (NeMo), and more. |
scripts/run.py |
Python API example — shows all three provider types (HuggingFace, vLLM, NeMo) and both dataset loaders side-by-side with comments. |
Running the batch benchmark script:
# Run with defaults (10 samples per dataset, results → results/benchmarks/)
bash scripts/evaluate_benchmarks.sh
# Override sample count and output directory
MAX_SAMPLES=100 OUTPUT_DIR=results/full bash scripts/evaluate_benchmarks.shTip:
evaluate_benchmarks.shis designed to be your personal run script — it is listed in.gitignoreso you can freely customise it (add models, change datasets, tweak flags) without affecting the repo.
The Python API offers the exact same capabilities as the CLI but allows integration into your own scripts.
from asr_evalkit.evaluators.providers import HuggingFaceProvider
from asr_evalkit.config import ModelConfig, NormalizerConfig
from asr_evalkit.datasets import HuggingFaceDataset
from asr_evalkit.runner import Runner
# Configure Provider (Handles weights + precision)
provider = HuggingFaceProvider(
model_id="openai/whisper-large-v3-turbo",
device="cuda",
use_fp16=True
)
# Set up Runner
model_cfg = ModelConfig(evaluator="whisper", provider=provider)
runner = Runner(model=model_cfg)
# Configure Dataset
hf_dataset = HuggingFaceDataset(
"openslr/librispeech_asr",
config="clean",
split="test",
audio_column="audio",
text_column="text",
streaming=True,
max_samples=10
)
# Run Evaluation
# Note: Default behaviour (lowercase + Unicode NFC + remove punctuation + remove extra whitespace)
results = runner.run(dataset=hf_dataset, normalizer=NormalizerConfig())
print(f"WER: {results['wer']*100:.2f}%")from asr_evalkit.evaluators.providers import NeMoProvider
from asr_evalkit.config import ModelConfig
from asr_evalkit.datasets import NeMoDataset
from asr_evalkit.runner import Runner
# Use NeMo provider instead
provider = NeMoProvider(model_id="nvidia/parakeet-tdt-0.6b-v3", device="cuda")
runner = Runner(model=ModelConfig(evaluator="parakeet", provider=provider))
# Load a local NeMo dataset manifest
nemo_dataset = NeMoDataset(
"path/to/test_manifest.jsonl",
audio_column="audio_filepath",
text_column="text",
max_samples=10
)
results = runner.run(dataset=nemo_dataset)To see a list of all currently supported models natively built into the package, run:
asr-evalkit --list-evaluators| Evaluator Name | Supported Provider | Target Models |
|---|---|---|
whisper |
HuggingFaceProvider |
openai/whisper-* |
canary |
NeMoProvider |
nvidia/canary-* |
parakeet |
NeMoProvider |
nvidia/parakeet-* |
qwen_omni |
HuggingFaceProvider |
Qwen/Qwen2.5-Omni-* |
qwen3_asr |
VLLMProvider |
Qwen/Qwen3-ASR-* |
meralion |
HuggingFaceProvider |
MERaLiON/MERaLiON-* |
phi4 |
HuggingFaceProvider |
microsoft/Phi-4-multimodal-instruct |
seallm_audio |
VLLMProvider |
SeaLLMs/SeaLLMs-Audio-* |
| Model | aishell1_mandarin (CER) | aishell3_mandarin (CER) | cv_mandarin (CER) | cv_tamil (WER) | fleurs_malay (WER) | fleurs_mandarin (CER) | fleurs_tamil (WER) | librispeech_english (WER) | mesolitica_malaysian (WER) | nsc_singlish (WER) | slr127_tamil (WER) | slr65_tamil (WER) | AVG |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| whisper-large-v3-turbo | 9.64 | 16.81 | 17.91 | 74.50 | 8.88 | 10.63 | 66.90 | 3.04 | 28.47 | 32.02 | 69.56 | 58.13 | 33.04 |
| SeaLLMs-Audio-7B | 9.65 | 9.76 | 8.68 | 126.70 | 26.25 | 37.09 | 105.31 | 94.74 | 71.34 | 9.53 | 138.65 | 127.24 | 63.75 |
| Qwen2.5-Omni-3B | 28.25 | 44.55 | 46.36 | 318.36 | 74.69 | 54.74 | 311.67 | 29.21 | 211.40 | 34.79 | 448.82 | 465.58 | 172.37 |
| Qwen2.5-Omni-7B | 7.33 | 22.58 | 14.49 | 252.06 | 43.92 | 16.68 | 326.43 | 13.80 | 158.06 | 22.96 | 303.96 | 239.15 | 118.45 |
| Qwen3-ASR-0.6B | 2.08 | 2.59 | 10.06 | 121.10 | 18.71 | 9.75 | 130.09 | 2.74 | 47.29 | 7.64 | 129.12 | 127.00 | 50.68 |
| Qwen3-ASR-1.7B | 1.52 | 2.08 | 7.50 | 139.96 | 10.87 | 9.33 | 147.23 | 2.31 | 39.00 | 6.22 | 144.49 | 134.63 | 53.76 |
| MERaLiON-2-10B-ASR | 3.09 | 4.07 | 8.83 | 31.78 | 8.55 | 11.99 | 28.68 | 2.54 | 25.90 | 4.62 | 22.42 | 19.29 | 14.31 |
| polyglot-lion-0.6b | 1.93 | 2.32 | 6.16 | 42.16 | 14.45 | 9.19 | 37.68 | 2.67 | 24.33 | 6.09 | 28.14 | 23.07 | 16.52 |
| polyglot-lion-1.7b | 1.45 | 1.86 | 4.91 | 39.19 | 9.98 | 8.00 | 37.28 | 2.10 | 21.51 | 5.28 | 26.83 | 19.75 | 14.85 |
ASR EvalKit is structured around a strict provider/evaluator separation to maintain clean code and easy API mapping.
If you need to customise the toolkit, please refer to the detailed guides in the docs/ folder:
- Architecture Overview — Understand how Providers, Evaluators, and Runners interact.
- Adding a Custom Evaluator — Step-by-step guide to adding support for a new model architecture.
- Adding a Custom Provider — How to support a new backend (e.g. ONNX Runtime, TensorRT-LLM).
- Dataset Configurations — Details on HF streaming vs. NeMo manifests.
- Metrics & Normalisers — How to write custom scoring metrics or ground-truth normalisers.
- CLI Reference — Full command-line options.
- Troubleshooting — Common runtime errors and fixes (vLLM, CUDA, torch, peft).
AssertionError: duplicate template namewhen running outside a virtual environmentIf you see this error when using vLLM-backed models (e.g.
seallm_audio,qwen3_asr) with the system Python, your torch installation has stale kernel files from a previous upgrade. See Troubleshooting → duplicate template name for the one-line fix.
If you use ASR EvalKit in your research, please cite:
@software{dang2026asrevalkit,
author = {Quy-Anh Dang},
title = {ASR EvalKit: A Modular Toolkit for Evaluating Automatic Speech Recognition Models},
year = {2026},
url = {https://github.com/knoveleng/asr-evalkit},
}