ASR EvalKit

A modular, extensible toolkit for evaluating Automatic Speech Recognition (ASR) models. It provides a unified Python API and a Command-Line Interface (CLI) to evaluate HuggingFace, NeMo, and vLLM-backed models across various datasets.

Key Features

Unified Provider System: Evaluate transformers, NVIDIA NeMo, or vLLM models seamlessly.
Dataset Support: Native support for HuggingFace datasets (with streaming) and local NeMo manifests.
Extensible: Effortlessly add new models, metrics, or normalisers via clean abstract base classes.

Installation

1. Install `uv` (recommended)

uv is a fast Python package manager. Install it with:

curl -LsSf https://astral.sh/uv/install.sh | sh

2. Clone and create a virtual environment

git clone https://github.com/knoveleng/asr-evalkit.git
cd asr-evalkit

# Create a virtual environment named 'asr'
uv venv asr --python 3.12

# Activate it
source asr/bin/activate

3. Install dependencies

# Install the package in editable mode (reads from requirements.txt via setup.py)
uv pip install -e .

# Optional — install NeMo and dev tools
uv pip install -e ".[nemo,dev]"

Note: For vLLM-backed models (seallm_audio, qwen3_asr), install vLLM separately:
uv pip install vllm

Quick Start: Command-Line Interface (CLI)

The easiest way to run evaluations is via the asr-evalkit command.

1. Evaluate a HuggingFace Model on a HuggingFace Dataset

asr-evalkit \
  --evaluator whisper \
  --model openai/whisper-large-v3-turbo \
  --dataset openslr/librispeech_asr \
  --dataset-config clean \
  --dataset-split test \
  --streaming \
  --audio-column audio \
  --text-column text \
  --output-file results.json

2. Evaluate a NeMo Model on a HuggingFace Dataset

asr-evalkit \
  --evaluator canary \
  --model nvidia/canary-1b \
  --dataset openslr/librispeech_asr \
  --dataset-config clean \
  --dataset-split test \
  --streaming \
  --audio-column audio_filepath \
  --text-column text \
  --output-file canary_results.json

Example Scripts

The scripts/ directory contains ready-to-use examples you can copy and adapt:

Script	Description
`scripts/evaluate_benchmarks.sh`	Batch benchmark runner — evaluates one or more models across a configurable list of HuggingFace datasets in a single run. Edit the `MODELS` and `BENCHMARKS` arrays at the top to select which models and datasets to run.
`scripts/run_cli.sh`	CLI examples — commented-out one-off `asr-evalkit` invocations for Whisper, MERaLiON, Qwen3-ASR (vLLM), Parakeet (NeMo), and more.
`scripts/run.py`	Python API example — shows all three provider types (HuggingFace, vLLM, NeMo) and both dataset loaders side-by-side with comments.

Running the batch benchmark script:

# Run with defaults (10 samples per dataset, results → results/benchmarks/)
bash scripts/evaluate_benchmarks.sh

# Override sample count and output directory
MAX_SAMPLES=100 OUTPUT_DIR=results/full bash scripts/evaluate_benchmarks.sh

Tip: evaluate_benchmarks.sh is designed to be your personal run script — it is listed in .gitignore so you can freely customise it (add models, change datasets, tweak flags) without affecting the repo.

Quick Start: Python API

The Python API offers the exact same capabilities as the CLI but allows integration into your own scripts.

1. HuggingFace Model & Dataset

from asr_evalkit.evaluators.providers import HuggingFaceProvider
from asr_evalkit.config import ModelConfig, NormalizerConfig
from asr_evalkit.datasets import HuggingFaceDataset
from asr_evalkit.runner import Runner

# Configure Provider (Handles weights + precision)
provider = HuggingFaceProvider(
    model_id="openai/whisper-large-v3-turbo",
    device="cuda",
    use_fp16=True
)

# Set up Runner
model_cfg = ModelConfig(evaluator="whisper", provider=provider)
runner = Runner(model=model_cfg)

# Configure Dataset
hf_dataset = HuggingFaceDataset(
    "openslr/librispeech_asr",
    config="clean",
    split="test",
    audio_column="audio",
    text_column="text",
    streaming=True,
    max_samples=10
)

# Run Evaluation
# Note: Default behaviour (lowercase + Unicode NFC + remove punctuation + remove extra whitespace)
results = runner.run(dataset=hf_dataset, normalizer=NormalizerConfig())

print(f"WER: {results['wer']*100:.2f}%")

2. NeMo Model on a Local Manifest

from asr_evalkit.evaluators.providers import NeMoProvider
from asr_evalkit.config import ModelConfig
from asr_evalkit.datasets import NeMoDataset
from asr_evalkit.runner import Runner

# Use NeMo provider instead
provider = NeMoProvider(model_id="nvidia/parakeet-tdt-0.6b-v3", device="cuda")
runner = Runner(model=ModelConfig(evaluator="parakeet", provider=provider))

# Load a local NeMo dataset manifest
nemo_dataset = NeMoDataset(
    "path/to/test_manifest.jsonl",
    audio_column="audio_filepath",
    text_column="text",
    max_samples=10
)

results = runner.run(dataset=nemo_dataset)

Supported Evaluators & Providers

To see a list of all currently supported models natively built into the package, run:

asr-evalkit --list-evaluators

Evaluator Name	Supported Provider	Target Models
`whisper`	`HuggingFaceProvider`	openai/whisper-*
`canary`	`NeMoProvider`	nvidia/canary-*
`parakeet`	`NeMoProvider`	nvidia/parakeet-*
`qwen_omni`	`HuggingFaceProvider`	Qwen/Qwen2.5-Omni-*
`qwen3_asr`	`VLLMProvider`	Qwen/Qwen3-ASR-*
`meralion`	`HuggingFaceProvider`	MERaLiON/MERaLiON-*
`phi4`	`HuggingFaceProvider`	microsoft/Phi-4-multimodal-instruct
`seallm_audio`	`VLLMProvider`	SeaLLMs/SeaLLMs-Audio-*

Benchmark Results

Model	aishell1_mandarin (CER)	aishell3_mandarin (CER)	cv_mandarin (CER)	cv_tamil (WER)	fleurs_malay (WER)	fleurs_mandarin (CER)	fleurs_tamil (WER)	librispeech_english (WER)	mesolitica_malaysian (WER)	nsc_singlish (WER)	slr127_tamil (WER)	slr65_tamil (WER)	AVG
whisper-large-v3-turbo	9.64	16.81	17.91	74.50	8.88	10.63	66.90	3.04	28.47	32.02	69.56	58.13	33.04
SeaLLMs-Audio-7B	9.65	9.76	8.68	126.70	26.25	37.09	105.31	94.74	71.34	9.53	138.65	127.24	63.75
Qwen2.5-Omni-3B	28.25	44.55	46.36	318.36	74.69	54.74	311.67	29.21	211.40	34.79	448.82	465.58	172.37
Qwen2.5-Omni-7B	7.33	22.58	14.49	252.06	43.92	16.68	326.43	13.80	158.06	22.96	303.96	239.15	118.45
Qwen3-ASR-0.6B	2.08	2.59	10.06	121.10	18.71	9.75	130.09	2.74	47.29	7.64	129.12	127.00	50.68
Qwen3-ASR-1.7B	1.52	2.08	7.50	139.96	10.87	9.33	147.23	2.31	39.00	6.22	144.49	134.63	53.76
MERaLiON-2-10B-ASR	3.09	4.07	8.83	31.78	8.55	11.99	28.68	2.54	25.90	4.62	22.42	19.29	14.31
polyglot-lion-0.6b	1.93	2.32	6.16	42.16	14.45	9.19	37.68	2.67	24.33	6.09	28.14	23.07	16.52
polyglot-lion-1.7b	1.45	1.86	4.91	39.19	9.98	8.00	37.28	2.10	21.51	5.28	26.83	19.75	14.85

Architecture & Customisation

ASR EvalKit is structured around a strict provider/evaluator separation to maintain clean code and easy API mapping.

If you need to customise the toolkit, please refer to the detailed guides in the docs/ folder:

Architecture Overview — Understand how Providers, Evaluators, and Runners interact.
Adding a Custom Evaluator — Step-by-step guide to adding support for a new model architecture.
Adding a Custom Provider — How to support a new backend (e.g. ONNX Runtime, TensorRT-LLM).
Dataset Configurations — Details on HF streaming vs. NeMo manifests.
Metrics & Normalisers — How to write custom scoring metrics or ground-truth normalisers.
CLI Reference — Full command-line options.
Troubleshooting — Common runtime errors and fixes (vLLM, CUDA, torch, peft).

⚠️ Known Issues

AssertionError: duplicate template name when running outside a virtual environment

If you see this error when using vLLM-backed models (e.g. seallm_audio, qwen3_asr) with the system Python, your torch installation has stale kernel files from a previous upgrade. See Troubleshooting → duplicate template name for the one-line fix.

Citation

If you use ASR EvalKit in your research, please cite:

@software{dang2026asrevalkit,
  author       = {Quy-Anh Dang},
  title        = {ASR EvalKit: A Modular Toolkit for Evaluating Automatic Speech Recognition Models},
  year         = {2026},
  url          = {https://github.com/knoveleng/asr-evalkit},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ASR EvalKit

Key Features

Installation

1. Install `uv` (recommended)

2. Clone and create a virtual environment

3. Install dependencies

Quick Start: Command-Line Interface (CLI)

1. Evaluate a HuggingFace Model on a HuggingFace Dataset

2. Evaluate a NeMo Model on a HuggingFace Dataset

Example Scripts

Quick Start: Python API

1. HuggingFace Model & Dataset

2. NeMo Model on a Local Manifest

Supported Evaluators & Providers

Benchmark Results

Architecture & Customisation

⚠️ Known Issues

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
asr_evalkit		asr_evalkit
data/nemo_samples		data/nemo_samples
docs		docs
results/benchmarks		results/benchmarks
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

ASR EvalKit

Key Features

Installation

1. Install uv (recommended)

2. Clone and create a virtual environment

3. Install dependencies

Quick Start: Command-Line Interface (CLI)

1. Evaluate a HuggingFace Model on a HuggingFace Dataset

2. Evaluate a NeMo Model on a HuggingFace Dataset

Example Scripts

Quick Start: Python API

1. HuggingFace Model & Dataset

2. NeMo Model on a Local Manifest

Supported Evaluators & Providers

Benchmark Results

Architecture & Customisation

⚠️ Known Issues

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Install `uv` (recommended)

Packages