Different evaluate result between different VLMEvalKit version #523

terry-for-github · 2024-10-16T04:34:20Z

The official LLaVA-v1.5-7B model gets 1362 points in MME Perception scores under the newest code 69b7b5e.

But it turns out to be 1497 under the 027e38c commit.

I was sure about the consistent LLaVA code between this two experiement, why would this happened?

The text was updated successfully, but these errors were encountered:

terry-for-github · 2024-10-16T06:54:14Z

I've check that once again. The problem still exists.

lyklly · 2024-11-30T05:30:30Z

i also notice this problem, however, when i check the code in 69b7b5e. I found that it only modified three parts：

Modify various log records, such as warning to logging
Change single quotes to double quotes
Modify some code problems
Have you solved this problem?

terry-for-github · 2024-12-03T01:42:05Z

i also notice this problem, however, when i check the code in 69b7b5e. I found that it only modified three parts：

Modify various log records, such as warning to logging

Change single quotes to double quotes

Modify some code problems
Have you solved this problem?

No. I just directly use the 027e38c version lol.

lyzhongcrd · 2024-12-03T10:25:01Z

I can confirm the issue on commit 9f21ee8.
"perception","reasoning","OCR","artwork","celebrity","code_reasoning","color","commonsense_reasoning","count","existence","landmark","numerical_calculation","position","posters","scene","text_translation" "1342.2684073629453","301.42857142857144","130.0","112.0","125.88235294117646","62.5","151.66666666666669","106.42857142857143","93.33333333333333","185.0","138.5","40.0","115.0","140.1360544217687","150.75","92.5"
The counting score is 93.3.
The counting score is 155 in a reference experiment (haotian-liu/LLaVA#927).
The score of counting is extremely low as Llava-1.5 tends to report the actual count instead of answering yes or no.

The same phenomenon is observed on MME-RealWorld-Lite, the counting score is 0.

This might be related to the system prompt or the instruction following capability of the model. But I am not sure how to address this.

Can @kennymckormick maybe have a look? Thanks!

terry-for-github · 2024-12-03T18:11:56Z

I can confirm the issue on commit 9f21ee8. "perception","reasoning","OCR","artwork","celebrity","code_reasoning","color","commonsense_reasoning","count","existence","landmark","numerical_calculation","position","posters","scene","text_translation" "1342.2684073629453","301.42857142857144","130.0","112.0","125.88235294117646","62.5","151.66666666666669","106.42857142857143","93.33333333333333","185.0","138.5","40.0","115.0","140.1360544217687","150.75","92.5" The counting score is 93.3. The counting score is 155 in a reference experiment (haotian-liu/LLaVA#927). The score of counting is extremely low as Llava-1.5 tends to report the actual count instead of answering yes or no.

The same phenomenon is observed on MME-RealWorld-Lite, the counting score is 0.

This might be related to the system prompt or the instruction following capability of the model. But I am not sure how to address this.

Can @kennymckormick maybe have a look? Thanks!

Good job! Thanks for the testing. I'm trying to figure out the problem, too.

terry-for-github mentioned this issue Oct 16, 2024

[Help Wanted] the alignment with official accuracy in llama3.2-vision #493

Open

terry-for-github mentioned this issue Oct 17, 2024

关于llava-v1.5-7b复现 #391

Open

kennymckormick self-assigned this Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different evaluate result between different VLMEvalKit version #523

Different evaluate result between different VLMEvalKit version #523

terry-for-github commented Oct 16, 2024

terry-for-github commented Oct 16, 2024

lyklly commented Nov 30, 2024

terry-for-github commented Dec 3, 2024

lyzhongcrd commented Dec 3, 2024 •

edited

Loading

terry-for-github commented Dec 3, 2024

Different evaluate result between different VLMEvalKit version #523

Different evaluate result between different VLMEvalKit version #523

Comments

terry-for-github commented Oct 16, 2024

terry-for-github commented Oct 16, 2024

lyklly commented Nov 30, 2024

terry-for-github commented Dec 3, 2024

lyzhongcrd commented Dec 3, 2024 • edited Loading

terry-for-github commented Dec 3, 2024

lyzhongcrd commented Dec 3, 2024 •

edited

Loading