Mathvision bug fixes , Reproduce Qwen2.5VL results #660
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description :
This PR fixes a bug in MathVision evaluation and includes improvements to reproduce the MathVision evaluation for Qwen2.5-VL.
The current MathVision evaluation codebase breaks due to the following issues:
The value for "until" is incorrectly set to "\n\n" by default, which causes the response to be truncated incorrectly.
Reference
The mathvision_doc_to_text(doc) function needs to accept lmms_eval_specific_kwargs as an argument.
Reference
Reproducing MathVision Results on Qwen2.5-3B-Instruct :
The current implementation produces a MathVision Standard Eval score of 0.46, which is incorrect. This issue occurs due to parsing only a truncated response (see Bug 1 above).
With the changes introduced in this PR and using a max_token_len of 2048, the official result of 21.68 for the Qwen2.5-3B-Instruct model can be reproduced.