[Task] Add new benchmark: CAPability #656

lntzm · 2025-04-27T13:08:20Z

This pull request adds a new visual caption benchmark: CAPability.

CAPability evaluates the LMMs by around 11K human-annotated image/video-annotation pairs. Different from directly annotating ground-truth captions, we design 13 dimensions and annotate information for each dimension, and evaluate from these 13 dimensions, calculating precision, recall, and f1-score as metrics. See this README for detailed introduction. Therefore, we create 1 default template yaml, 13 config yamls for each dimension, and 1 final grouped config yaml. The utils.py and prompt.py support the inference and evaluation process.

Check these links if you are interested in CAPability: ArXiv, Project Page, Huggingface Dataset, and Github Repo.

This is an example of an evaluation pipeline:

This PR does not break any existing functionality in lmms-eval.

kcz358 · 2025-05-01T03:54:15Z

lmms_eval/tasks/capability/utils.py

+OPENAI_API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
+OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")


For consistency, it is better to also support loading from Azure endpoint also. You can check other tasks such as llava in the wild for reference

kcz358 · 2025-05-01T03:55:35Z

lmms_eval/tasks/capability/utils.py

+    # delete the invalid evaluation results as lmms-eval do not support auto-resume inference
+    # to ensure re-run evaluation if re-run inference
+    eval_save_path = os.path.join(os.path.dirname(save_path), f"../evaluation/{task}.jsonl")
+    if os.path.exists(eval_save_path):
+        os.remove(eval_save_path)


Should user also pay attention to this hardcoded eval save path?

This hardcoded save path is used for saving the evaluation result when running the first metric (precision), and then directly reading the record when running the second and third metrics (recall and f1-score) rather than re-running GPT evaluation. I am not sure whether there is a better way to achieve this.

The removal of this path is in the case of unexpected interruption. As lmms-eval does not support resuming eval, user has to re-run the inference and evaluation when meeting interruption. Then the previously saved evaluation records should be removed to avoid loading it. What is expected to do to let users pay attention to this hardcoded eval save path? Maybe add a config to control?

No problem about hardcoding this right now. I think currently you can add a warning log here for to notify the user about this. I may add the resume evaluation into the next upgrade feature if possible

kcz358

Hi, thank you for your contribution. Most of the codes LGTM, you may check some of the comments to see whether it should be addressed. Otherwise, I will merge this PR. Thanks!

add support of CAPability

e40010c

lntzm changed the title ~~Add new benchmark: CAPability~~ [Task] Add new benchmark: CAPability Apr 30, 2025

kcz358 reviewed May 1, 2025

View reviewed changes

kcz358 approved these changes May 1, 2025

View reviewed changes

liuzh added 2 commits May 3, 2025 16:13

add support of azure api

ad4528c

add warning of os.remove

01eac47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task] Add new benchmark: CAPability #656

[Task] Add new benchmark: CAPability #656

lntzm commented Apr 27, 2025 •

edited

Loading

kcz358 May 1, 2025

kcz358 May 1, 2025

lntzm May 3, 2025

kcz358 May 3, 2025

lntzm May 3, 2025

kcz358 left a comment

		OPENAI_API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
		OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")

[Task] Add new benchmark: CAPability #656

Are you sure you want to change the base?

[Task] Add new benchmark: CAPability #656

Conversation

lntzm commented Apr 27, 2025 • edited Loading

kcz358 May 1, 2025

Choose a reason for hiding this comment

kcz358 May 1, 2025

Choose a reason for hiding this comment

lntzm May 3, 2025

Choose a reason for hiding this comment

kcz358 May 3, 2025

Choose a reason for hiding this comment

lntzm May 3, 2025

Choose a reason for hiding this comment

kcz358 left a comment

Choose a reason for hiding this comment

lntzm commented Apr 27, 2025 •

edited

Loading