Skip to content

aiben-ch/Benchmark-Redundancy

Repository files navigation

Redundancy Principles for MLLMs Benchmarks

Where Redundancy Exists? and Why Evaluate Redundancy?

1Shanghai AI Lab, 2Shanghai Jiaotong University, 3Zhejiang University
*Equal contribution. #Corresponding authors.
Arxiv | Github

The rapid growth of MLLM benchmarks has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. Specifically, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains.

Redundancy Framework

Description

Dimensions Redundancy

$$\rho(X_i) = \frac{1}{m-1} \sum_{\substack{j=1 \\ j \neq i}}^m \text{CORR}(R_i, R_j),$$

where $$\\\text{CORR}(R_i, R_j)$$ is the correlation coefficient between the rankings $$R_i$$ and $$R_j$$. The rankings $$R_i$$ and $$R_j$$ are the performance ranking of MLLMs on i-th and j-th dimensions of the benchmark.


Instances Redundancy

$$\rho(A\%) = \text{CORR}(R_{A\%}, R_{\text{GT}})$$,

where $$R_{A\%}$$ is the MLLM ranking based on the sampled $$A\%$$ instances, and $$R_{\text{GT}}$$ is the MLLM ranking based on all instances within the MLLM benchmark.

Cross-Benchmark Redundancy

$$\rho(Y_i) = \frac{1}{l-1} \sum_{\substack{j=1 \\ j \neq i}}^l \text{CORR}(K_i, K_j), $$

where $$\text{CORR}(K_i, K_j)$$ is the correlation coefficient between the rankings $$K_i$$ and $$K_j$$. The rankings $$R_i$$ and $$R_j$$ are the performance ranking of MLLMs on i-th and j-th benchmarks of the specific domain.

Redundancy Principles Recommendations

We recommend performing redundancy detection on the benchmark after it is designed and initially tested on some MLLMs. This step ensures that the evaluation results are both more scientifically reliable and more efficient.

1) Dimensions Redundancy Check

Calculate the dimensional redundancy of the benchmark, placing particular emphasis on dimensions with overall high redundancy. Additionally, analyze the redundancy heatmap to identify pairs of dimensions with exceptionally high redundancy and evaluate whether these dimensions assess similar capabilities.

2) Instances Redundancy Check

Compute the instance redundancy curve of the benchmark and assess whether a limited subset of the instances can yield results similar to those of the full instances. If significant instance redundancy is identified, it is essential to review and reduce the redundant instances.

3) Cross-benchmark Redundancy Check

If you aim to design a benchmark as a representative for a specific vertical domain, calculate the cross-benchmark redundancy within that domain. Higher redundancy indicates stronger representativeness. On the other hand, if your goal is to identify gaps within a vertical domain, it is better to keep redundancy low to ensure broader coverage.

If you want to test the core capabilities of a vertical domain under limited resources, it is recommended to select the benchmark with the highest cross-benchmark redundancy within the domain.

Redundancy Results

1-A Dimensions Redundancy Heatmaps on MMBench

Top-50 SRCC

Figure 1: Top-50 SRCC dimensions redundancy map

Bottom-50 SRCC

Figure 2: Bottom-50 SRCC dimensions redundancy map

Top-50 PLCC

Figure 3: Top-50 PLCC dimensions redundancy map

Bottom-50 PLCC

Figure 4: Bottom-50 PLCC dimensions redundancy map

Top-50 R2

Figure 5: Top-50 R2 dimensions redundancy map

Bottom-50 R2

Figure 6: Bottom-50 R2 dimensions redundancy map

1-B Dimensions Redundancy Bar Plots on MMBench

Top-50 SRCC redundancy

Figure 7: Top-50 SRCC redundancy

Top-50 PLCC redundancy

Figure 8: Top-50 PLCC redundancy

Top-50 R2 redundancy

Figure 9: Top-50 R2 redundancy

Bottom-50 SRCC redundancy

Figure 10: Bottom-50 SRCC redundancy

Bottom-50 PLCC redundancy

Figure 11: Bottom-50 PLCC redundancy

Bottom-50 R2 redundancy

Figure 12: Bottom-50 R2 redundancy

  1. These figures allow for a quick analysis of which dimensions exhibit high correlations.
  2. All Bottom-50 dimensions exhibit significantly higher redundancy than Top-50 redundancy.

2 Instances Redundancy Curves

Instances redundancy with Top-50 MLLMs

Figure 13: Instances redundancy with Top-50 MLLMs.

Instances redundancy with Bottom-50 MLLMs

Figure 14: Instances redundancy with Bottom-50 MLLMs.

  1. The majority of current MLLM benchmarks exhibit significant redundancy in their instances to rank both Top-50 and Bottom-50 MLLMs, with at least 50% of instances being redundant.
  2. The R² score can be understood as representing the fitness of predicting the final performance of MLLMs based on sampled instances. Compared to ensuring ranking accuracy, achieving high accuracy in predicting the absolute performance of MLLMs typically requires significantly more instances.

How to Calculate Benchmark Redundancy

We present a sample script benchmark_dimensions_redundancy.py to quickly do the benchmark dimensions redundancy check from the MLLM results on OpenCompass.

Use the following command to do the Top-50 redundancy check for MMBench

python benchmark_dimensions_redundancy.py --input_file opencompass_vlm.json --bench MMBench_TEST_EN_V11 --save_folder results --top_k 50 --vmin -0.1 --vmax 1.0

Use the following command to do the Bottom-50 redundancy check for MM-Bench

python benchmark_dimensions_redundancy.py --input_file opencompass_vlm.json --bench MMBench_TEST_EN_V11 --save_folder results --top_k 50 --vmin -0.1 --vmax 1.0

Contact

Please contact any of the first authors of this paper for queries.

  • Zicheng Zhang, zzc1998@sjtu.edu.cn, @zzc-1998

Citation

If you find our work useful, please cite our paper as

@misc{zhang2025redundancyprinciplesmllmsbenchmarks,
      title={Redundancy Principles for MLLMs Benchmarks}, 
      author={Zicheng Zhang and Xiangyu Zhao and Xinyu Fang and Chunyi Li and Xiaohong Liu and Xiongkuo Min and Haodong Duan and Kai Chen and Guangtao Zhai},
      year={2025},
      eprint={2501.13953},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.13953}, 
}

About

[ACL 2025] Redundancy Principles for MLLMs Benchmarks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages