Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different evaluate result between different VLMEvalKit version #523

Open
terry-for-github opened this issue Oct 16, 2024 · 5 comments
Open
Assignees

Comments

@terry-for-github
Copy link

The official LLaVA-v1.5-7B model gets 1362 points in MME Perception scores under the newest code 69b7b5e.
image
But it turns out to be 1497 under the 027e38c commit.
51a97534b233e0574351b8ef2318272
I was sure about the consistent LLaVA code between this two experiement, why would this happened?

@terry-for-github
Copy link
Author

I've check that once again. The problem still exists.

@lyklly
Copy link

lyklly commented Nov 30, 2024

i also notice this problem, however, when i check the code in 69b7b5e. I found that it only modified three parts:

  1. Modify various log records, such as warning to logging
  2. Change single quotes to double quotes
  3. Modify some code problems
    Have you solved this problem?

@terry-for-github
Copy link
Author

i also notice this problem, however, when i check the code in 69b7b5e. I found that it only modified three parts:

  1. Modify various log records, such as warning to logging
  2. Change single quotes to double quotes
  3. Modify some code problems
    Have you solved this problem?

No. I just directly use the 027e38c version lol.

@lyzhongcrd
Copy link

lyzhongcrd commented Dec 3, 2024

I can confirm the issue on commit 9f21ee8.
"perception","reasoning","OCR","artwork","celebrity","code_reasoning","color","commonsense_reasoning","count","existence","landmark","numerical_calculation","position","posters","scene","text_translation" "1342.2684073629453","301.42857142857144","130.0","112.0","125.88235294117646","62.5","151.66666666666669","106.42857142857143","93.33333333333333","185.0","138.5","40.0","115.0","140.1360544217687","150.75","92.5"
The counting score is 93.3.
The counting score is 155 in a reference experiment (haotian-liu/LLaVA#927).
The score of counting is extremely low as Llava-1.5 tends to report the actual count instead of answering yes or no.
image

The same phenomenon is observed on MME-RealWorld-Lite, the counting score is 0.

This might be related to the system prompt or the instruction following capability of the model. But I am not sure how to address this.

Can @kennymckormick maybe have a look? Thanks!

@terry-for-github
Copy link
Author

I can confirm the issue on commit 9f21ee8. "perception","reasoning","OCR","artwork","celebrity","code_reasoning","color","commonsense_reasoning","count","existence","landmark","numerical_calculation","position","posters","scene","text_translation" "1342.2684073629453","301.42857142857144","130.0","112.0","125.88235294117646","62.5","151.66666666666669","106.42857142857143","93.33333333333333","185.0","138.5","40.0","115.0","140.1360544217687","150.75","92.5" The counting score is 93.3. The counting score is 155 in a reference experiment (haotian-liu/LLaVA#927). The score of counting is extremely low as Llava-1.5 tends to report the actual count instead of answering yes or no. image

The same phenomenon is observed on MME-RealWorld-Lite, the counting score is 0.

This might be related to the system prompt or the instruction following capability of the model. But I am not sure how to address this.

Can @kennymckormick maybe have a look? Thanks!

Good job! Thanks for the testing. I'm trying to figure out the problem, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants