Llama3-as-a-judge issues

I am having quite some troubles with the Llama-3-as-a-judge pipeline. Here are two issues I've encountered:

1. If Llama does not provide a valid score and hence the index [here](https://github.com/ServiceNow/insight-bench/blob/33c27c1282cb7ed73267d36fb84e27b6ea8aac2b/insightbench/utils/eval_utils.py#L82) gets out of bound, then the current script would get into an infinite loop due to the While True try catch logic. Did you encounter this? I am currently just setting the scores to be 0 for these cases.
2. More importantly, many of my LLaMA scores appear to be incorrect. The model seems especially prone to copying the example score of “7” shown in [this line
](https://github.com/ServiceNow/insight-bench/blob/33c27c1282cb7ed73267d36fb84e27b6ea8aac2b/insightbench/prompts/__init__.py#L443), which results in a large number of falsely high evaluation scores. Could you share a few sample input–score pairs from your baseline runs so I can better debug this issue? For instance, here is one evaluation I have that seems very wrong:
```
        {
            "pred_insight": "The \"Dell Latitude 7490\" stands out as the only configuration item with variability in declined amounts, exhibiting a standard deviation of 2,404.49, confirming its unique pattern among the analyzed data.",
            "gt_insight": "No Correlation Between the Number of Expense Reports Submitted and Rejection Rates",
            "score": 0.7857845326994692
        },
```

I believe the only modification I made to the pipeline is adding a `"chat_template"` field to the `tokenizer_config.json` of `Meta-Llama-3-70B`. If I didn't add this line I'd get:
```
[serving_chat.py:251] ValueError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
``` 
from VLLM. I simply copied the chat_template from https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3-as-a-judge issues #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Llama3-as-a-judge issues #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions