-
Notifications
You must be signed in to change notification settings - Fork 272
[Task] Add new benchmark: CAPability #656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
lmms_eval/tasks/capability/utils.py
Outdated
OPENAI_API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions") | ||
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, it is better to also support loading from Azure endpoint also. You can check other tasks such as llava in the wild for reference
lmms_eval/tasks/capability/utils.py
Outdated
# delete the invalid evaluation results as lmms-eval do not support auto-resume inference | ||
# to ensure re-run evaluation if re-run inference | ||
eval_save_path = os.path.join(os.path.dirname(save_path), f"../evaluation/{task}.jsonl") | ||
if os.path.exists(eval_save_path): | ||
os.remove(eval_save_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should user also pay attention to this hardcoded eval save path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hardcoded save path is used for saving the evaluation result when running the first metric (precision), and then directly reading the record when running the second and third metrics (recall and f1-score) rather than re-running GPT evaluation. I am not sure whether there is a better way to achieve this.
The removal of this path is in the case of unexpected interruption. As lmms-eval does not support resuming eval, user has to re-run the inference and evaluation when meeting interruption. Then the previously saved evaluation records should be removed to avoid loading it. What is expected to do to let users pay attention to this hardcoded eval save path? Maybe add a config to control?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem about hardcoding this right now. I think currently you can add a warning log here for to notify the user about this. I may add the resume evaluation into the next upgrade feature if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get it~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for your contribution. Most of the codes LGTM, you may check some of the comments to see whether it should be addressed. Otherwise, I will merge this PR. Thanks!
This pull request adds a new visual caption benchmark: CAPability.
CAPability evaluates the LMMs by around 11K human-annotated image/video-annotation pairs. Different from directly annotating ground-truth captions, we design 13 dimensions and annotate information for each dimension, and evaluate from these 13 dimensions, calculating precision, recall, and f1-score as metrics. See this README for detailed introduction. Therefore, we create 1 default template yaml, 13 config yamls for each dimension, and 1 final grouped config yaml. The
utils.py
andprompt.py
support the inference and evaluation process.Check these links if you are interested in CAPability: ArXiv, Project Page, Huggingface Dataset, and Github Repo.
This is an example of an evaluation pipeline:

This PR does not break any existing functionality in lmms-eval.