feat: add MMBench static evaluation mode (no OpenAI API needed) by Luodian · Pull Request #1276 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2026-03-26T16:10:33Z

Summary

Add a static evaluation variant for MMBench EN Dev that uses regex-based MCQ answer extraction instead of GPT API calls, enabling fully offline evaluation without API costs.

Changes

New mmbench_en_dev_static task config
New mmbench_aggregate_dev_results_static function in en_utils.py
Extended eval_sub_data and eval_result in mmbench_evals.py to support static eval method
New shared _task_utils/mcq_extract.py utility for robust answer extraction across 10+ answer formats

Usage

# Static eval (no API key needed)
lmms-eval --tasks mmbench_en_dev_static --model <model> ...

# Original GPT-based eval still works
lmms-eval --tasks mmbench_en_dev --model <model> ...

Note

mcq_extract.py is also included in #1272 (physics benchmarks PR). Both PRs can merge independently — the file is identical.

Test plan

Run mmbench_en_dev_static and compare accuracy with GPT-based eval
Verify original mmbench_en_dev task still works unchanged
Test MCQ extraction with various answer formats (A, (B), C., "the answer is D")

Add a static evaluation variant for MMBench EN Dev that uses regex/substring MCQ extraction instead of GPT API calls. This enables offline evaluation without API costs. Changes: - New `mmbench_en_dev_static` task config - New `mmbench_aggregate_dev_results_static` aggregation function - Extended `eval_sub_data` and `eval_result` to support `static` eval method - Includes shared `mcq_extract.py` utility for robust answer extraction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add MMBench static evaluation mode (no OpenAI API needed)#1276

feat: add MMBench static evaluation mode (no OpenAI API needed)#1276
Luodian wants to merge 1 commit intomainfrom
feat/mmbench-static-eval

Luodian commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luodian commented Mar 26, 2026

Summary

Changes

Usage

Note

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant