-
Notifications
You must be signed in to change notification settings - Fork 16
Llama3-as-a-judge issues #17
Copy link
Copy link
Open
Description
I am having quite some troubles with the Llama-3-as-a-judge pipeline. Here are two issues I've encountered:
- If Llama does not provide a valid score and hence the index here gets out of bound, then the current script would get into an infinite loop due to the While True try catch logic. Did you encounter this? I am currently just setting the scores to be 0 for these cases.
- More importantly, many of my LLaMA scores appear to be incorrect. The model seems especially prone to copying the example score of “7” shown in this line
, which results in a large number of falsely high evaluation scores. Could you share a few sample input–score pairs from your baseline runs so I can better debug this issue? For instance, here is one evaluation I have that seems very wrong:
{
"pred_insight": "The \"Dell Latitude 7490\" stands out as the only configuration item with variability in declined amounts, exhibiting a standard deviation of 2,404.49, confirming its unique pattern among the analyzed data.",
"gt_insight": "No Correlation Between the Number of Expense Reports Submitted and Rejection Rates",
"score": 0.7857845326994692
},
I believe the only modification I made to the pipeline is adding a "chat_template" field to the tokenizer_config.json of Meta-Llama-3-70B. If I didn't add this line I'd get:
[serving_chat.py:251] ValueError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
from VLLM. I simply copied the chat_template from https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels