Skip to content

feat: add MMBench static evaluation mode (no OpenAI API needed)#1276

Open
Luodian wants to merge 1 commit intomainfrom
feat/mmbench-static-eval
Open

feat: add MMBench static evaluation mode (no OpenAI API needed)#1276
Luodian wants to merge 1 commit intomainfrom
feat/mmbench-static-eval

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Mar 26, 2026

Summary

Add a static evaluation variant for MMBench EN Dev that uses regex-based MCQ answer extraction instead of GPT API calls, enabling fully offline evaluation without API costs.

Changes

  • New mmbench_en_dev_static task config
  • New mmbench_aggregate_dev_results_static function in en_utils.py
  • Extended eval_sub_data and eval_result in mmbench_evals.py to support static eval method
  • New shared _task_utils/mcq_extract.py utility for robust answer extraction across 10+ answer formats

Usage

# Static eval (no API key needed)
lmms-eval --tasks mmbench_en_dev_static --model <model> ...

# Original GPT-based eval still works
lmms-eval --tasks mmbench_en_dev --model <model> ...

Note

mcq_extract.py is also included in #1272 (physics benchmarks PR). Both PRs can merge independently — the file is identical.

Test plan

  • Run mmbench_en_dev_static and compare accuracy with GPT-based eval
  • Verify original mmbench_en_dev task still works unchanged
  • Test MCQ extraction with various answer formats (A, (B), C., "the answer is D")

Add a static evaluation variant for MMBench EN Dev that uses regex/substring
MCQ extraction instead of GPT API calls. This enables offline evaluation
without API costs.

Changes:
- New `mmbench_en_dev_static` task config
- New `mmbench_aggregate_dev_results_static` aggregation function
- Extended `eval_sub_data` and `eval_result` to support `static` eval method
- Includes shared `mcq_extract.py` utility for robust answer extraction
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant