Redundancy Principles for MLLMs Benchmarks

Where Redundancy Exists? and Why Evaluate Redundancy?

Zicheng Zhang^1,2^*, Xiangyu Zhao^1,2^*, Xinyu Fang^1,3, Chunyi Li^1,2, Xiaohong Liu²,

Xiongkuo Min², Haodong Duan¹^#, Kai Chen¹^#, Guangtao Zhai^1,2^#

¹Shanghai AI Lab, ²Shanghai Jiaotong University, ³Zhejiang University

^*Equal contribution. ^#Corresponding authors.

Arxiv | Github

The rapid growth of MLLM benchmarks has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. Specifically, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains.

Redundancy Framework

Dimensions Redundancy

$$\rho(X_i) = \frac{1}{m-1} \sum_{\substack{j=1 \\ j \neq i}}^m \text{CORR}(R_i, R_j),$$

where $$\\\text{CORR}(R_i, R_j)$$ is the correlation coefficient between the rankings $$R_i$$ and $$R_j$$. The rankings $$R_i$$ and $$R_j$$ are the performance ranking of MLLMs on i-th and j-th dimensions of the benchmark.

Instances Redundancy

$$\rho(A\%) = \text{CORR}(R_{A\%}, R_{\text{GT}})$$,

where $$R_{A\%}$$ is the MLLM ranking based on the sampled $$A\%$$ instances, and $$R_{\text{GT}}$$ is the MLLM ranking based on all instances within the MLLM benchmark.

Cross-Benchmark Redundancy

$$\rho(Y_i) = \frac{1}{l-1} \sum_{\substack{j=1 \\ j \neq i}}^l \text{CORR}(K_i, K_j), $$

where $$\text{CORR}(K_i, K_j)$$ is the correlation coefficient between the rankings $$K_i$$ and $$K_j$$. The rankings $$R_i$$ and $$R_j$$ are the performance ranking of MLLMs on i-th and j-th benchmarks of the specific domain.

Redundancy Principles Recommendations

We recommend performing redundancy detection on the benchmark after it is designed and initially tested on some MLLMs. This step ensures that the evaluation results are both more scientifically reliable and more efficient.

1) Dimensions Redundancy Check

Calculate the dimensional redundancy of the benchmark, placing particular emphasis on dimensions with overall high redundancy. Additionally, analyze the redundancy heatmap to identify pairs of dimensions with exceptionally high redundancy and evaluate whether these dimensions assess similar capabilities.

2) Instances Redundancy Check

Compute the instance redundancy curve of the benchmark and assess whether a limited subset of the instances can yield results similar to those of the full instances. If significant instance redundancy is identified, it is essential to review and reduce the redundant instances.

3) Cross-benchmark Redundancy Check

If you aim to design a benchmark as a representative for a specific vertical domain, calculate the cross-benchmark redundancy within that domain. Higher redundancy indicates stronger representativeness. On the other hand, if your goal is to identify gaps within a vertical domain, it is better to keep redundancy low to ensure broader coverage.

If you want to test the core capabilities of a vertical domain under limited resources, it is recommended to select the benchmark with the highest cross-benchmark redundancy within the domain.

Redundancy Results

1-A Dimensions Redundancy Heatmaps on MMBench

Figure 1: Top-50 SRCC dimensions redundancy map	Figure 2: Bottom-50 SRCC dimensions redundancy map
Figure 3: Top-50 PLCC dimensions redundancy map	Figure 4: Bottom-50 PLCC dimensions redundancy map
Figure 5: Top-50 R2 dimensions redundancy map	Figure 6: Bottom-50 R2 dimensions redundancy map

1-B Dimensions Redundancy Bar Plots on MMBench

Figure 7: Top-50 SRCC redundancy	Figure 8: Top-50 PLCC redundancy	Figure 9: Top-50 R2 redundancy
Figure 10: Bottom-50 SRCC redundancy	Figure 11: Bottom-50 PLCC redundancy	Figure 12: Bottom-50 R2 redundancy

These figures allow for a quick analysis of which dimensions exhibit high correlations.
All Bottom-50 dimensions exhibit significantly higher redundancy than Top-50 redundancy.

2 Instances Redundancy Curves

Figure 13: Instances redundancy with Top-50 MLLMs.

Figure 14: Instances redundancy with Bottom-50 MLLMs.

The majority of current MLLM benchmarks exhibit significant redundancy in their instances to rank both Top-50 and Bottom-50 MLLMs, with at least 50% of instances being redundant.
The R² score can be understood as representing the fitness of predicting the final performance of MLLMs based on sampled instances. Compared to ensuring ranking accuracy, achieving high accuracy in predicting the absolute performance of MLLMs typically requires significantly more instances.

How to Calculate Benchmark Redundancy

We present a sample script benchmark_dimensions_redundancy.py to quickly do the benchmark dimensions redundancy check from the MLLM results on OpenCompass.

Use the following command to do the Top-50 redundancy check for MMBench

python benchmark_dimensions_redundancy.py --input_file opencompass_vlm.json --bench MMBench_TEST_EN_V11 --save_folder results --top_k 50 --vmin -0.1 --vmax 1.0

Use the following command to do the Bottom-50 redundancy check for MM-Bench

python benchmark_dimensions_redundancy.py --input_file opencompass_vlm.json --bench MMBench_TEST_EN_V11 --save_folder results --top_k 50 --vmin -0.1 --vmax 1.0

Contact

Please contact any of the first authors of this paper for queries.

Zicheng Zhang, zzc1998@sjtu.edu.cn, @zzc-1998

Citation

If you find our work useful, please cite our paper as

@misc{zhang2025redundancyprinciplesmllmsbenchmarks,
      title={Redundancy Principles for MLLMs Benchmarks}, 
      author={Zicheng Zhang and Xiangyu Zhao and Xinyu Fang and Chunyi Li and Xiaohong Liu and Xiongkuo Min and Haodong Duan and Kai Chen and Guangtao Zhai},
      year={2025},
      eprint={2501.13953},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.13953}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
MMBench_TEST_EN_V11_png		MMBench_TEST_EN_V11_png
instances_figs		instances_figs
utils		utils
Dimensions_Redundancy.pdf		Dimensions_Redundancy.pdf
README.md		README.md
Redundancy_Principles.pdf		Redundancy_Principles.pdf
ceaser.png		ceaser.png
framework.png		framework.png
redundancy.pptx		redundancy.pptx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Redundancy Principles for MLLMs Benchmarks

Redundancy Framework

Dimensions Redundancy

Instances Redundancy

Cross-Benchmark Redundancy

Redundancy Principles Recommendations

Redundancy Results

1-A Dimensions Redundancy Heatmaps on MMBench

1-B Dimensions Redundancy Bar Plots on MMBench

2 Instances Redundancy Curves

How to Calculate Benchmark Redundancy

Contact

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Redundancy Principles for MLLMs Benchmarks

Redundancy Framework

Dimensions Redundancy

Instances Redundancy

Cross-Benchmark Redundancy

Redundancy Principles Recommendations

Redundancy Results

1-A Dimensions Redundancy Heatmaps on MMBench

1-B Dimensions Redundancy Bar Plots on MMBench

2 Instances Redundancy Curves

How to Calculate Benchmark Redundancy

Contact

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages