Skip to content

[Task] Add new benchmark: CAPability #656

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

lntzm
Copy link

@lntzm lntzm commented Apr 27, 2025

This pull request adds a new visual caption benchmark: CAPability.

CAPability evaluates the LMMs by around 11K human-annotated image/video-annotation pairs. Different from directly annotating ground-truth captions, we design 13 dimensions and annotate information for each dimension, and evaluate from these 13 dimensions, calculating precision, recall, and f1-score as metrics. See this README for detailed introduction. Therefore, we create 1 default template yaml, 13 config yamls for each dimension, and 1 final grouped config yaml. The utils.py and prompt.py support the inference and evaluation process.

Check these links if you are interested in CAPability: ArXiv, Project Page, Huggingface Dataset, and Github Repo.

This is an example of an evaluation pipeline:
example

This PR does not break any existing functionality in lmms-eval.

@lntzm lntzm changed the title Add new benchmark: CAPability [Task] Add new benchmark: CAPability Apr 30, 2025
Comment on lines 25 to 26
OPENAI_API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, it is better to also support loading from Azure endpoint also. You can check other tasks such as llava in the wild for reference

Comment on lines 91 to 95
# delete the invalid evaluation results as lmms-eval do not support auto-resume inference
# to ensure re-run evaluation if re-run inference
eval_save_path = os.path.join(os.path.dirname(save_path), f"../evaluation/{task}.jsonl")
if os.path.exists(eval_save_path):
os.remove(eval_save_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should user also pay attention to this hardcoded eval save path?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hardcoded save path is used for saving the evaluation result when running the first metric (precision), and then directly reading the record when running the second and third metrics (recall and f1-score) rather than re-running GPT evaluation. I am not sure whether there is a better way to achieve this.

The removal of this path is in the case of unexpected interruption. As lmms-eval does not support resuming eval, user has to re-run the inference and evaluation when meeting interruption. Then the previously saved evaluation records should be removed to avoid loading it. What is expected to do to let users pay attention to this hardcoded eval save path? Maybe add a config to control?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem about hardcoding this right now. I think currently you can add a warning log here for to notify the user about this. I may add the resume evaluation into the next upgrade feature if possible

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get it~

Copy link
Collaborator

@kcz358 kcz358 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for your contribution. Most of the codes LGTM, you may check some of the comments to see whether it should be addressed. Otherwise, I will merge this PR. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants