Where Redundancy Exists? and Why Evaluate Redundancy?
The rapid growth of MLLM benchmarks has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. Specifically, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains.
We recommend performing redundancy detection on the benchmark after it is designed and initially tested on some MLLMs. This step ensures that the evaluation results are both more scientifically reliable and more efficient.
1) Dimensions Redundancy Check
Calculate the dimensional redundancy of the benchmark, placing particular emphasis on dimensions with overall high redundancy. Additionally, analyze the redundancy heatmap to identify pairs of dimensions with exceptionally high redundancy and evaluate whether these dimensions assess similar capabilities.
2) Instances Redundancy Check
Compute the instance redundancy curve of the benchmark and assess whether a limited subset of the instances can yield results similar to those of the full instances. If significant instance redundancy is identified, it is essential to review and reduce the redundant instances.
3) Cross-benchmark Redundancy Check
If you aim to design a benchmark as a representative for a specific vertical domain, calculate the cross-benchmark redundancy within that domain. Higher redundancy indicates stronger representativeness. On the other hand, if your goal is to identify gaps within a vertical domain, it is better to keep redundancy low to ensure broader coverage.
If you want to test the core capabilities of a vertical domain under limited resources, it is recommended to select the benchmark with the highest cross-benchmark redundancy within the domain.
- These figures allow for a quick analysis of which dimensions exhibit high correlations.
- All Bottom-50 dimensions exhibit significantly higher redundancy than Top-50 redundancy.
Figure 13: Instances redundancy with Top-50 MLLMs. |
Figure 14: Instances redundancy with Bottom-50 MLLMs. |
- The majority of current MLLM benchmarks exhibit significant redundancy in their instances to rank both Top-50 and Bottom-50 MLLMs, with at least 50% of instances being redundant.
- The R² score can be understood as representing the fitness of predicting the final performance of MLLMs based on sampled instances. Compared to ensuring ranking accuracy, achieving high accuracy in predicting the absolute performance of MLLMs typically requires significantly more instances.
We present a sample script benchmark_dimensions_redundancy.py to quickly do the benchmark dimensions redundancy check from the MLLM results on OpenCompass.
Use the following command to do the Top-50 redundancy check for MMBench
python benchmark_dimensions_redundancy.py --input_file opencompass_vlm.json --bench MMBench_TEST_EN_V11 --save_folder results --top_k 50 --vmin -0.1 --vmax 1.0
Use the following command to do the Bottom-50 redundancy check for MM-Bench
python benchmark_dimensions_redundancy.py --input_file opencompass_vlm.json --bench MMBench_TEST_EN_V11 --save_folder results --top_k 50 --vmin -0.1 --vmax 1.0
Please contact any of the first authors of this paper for queries.
- Zicheng Zhang,
zzc1998@sjtu.edu.cn, @zzc-1998
If you find our work useful, please cite our paper as
@misc{zhang2025redundancyprinciplesmllmsbenchmarks,
title={Redundancy Principles for MLLMs Benchmarks},
author={Zicheng Zhang and Xiangyu Zhao and Xinyu Fang and Chunyi Li and Xiaohong Liu and Xiongkuo Min and Haodong Duan and Kai Chen and Guangtao Zhai},
year={2025},
eprint={2501.13953},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.13953},
}















