Add evidence field to judge criteria results by rajdeepmahal24 · Pull Request #215 · langwatch/scenario

rajdeepmahal24 · 2026-01-17T21:15:27Z

Summary

add evidence to the judge tool schema so each criterion includes supporting quotes/values
return evidence mapping in both Python ScenarioResult and JS JudgeResult
update judge prompt to request evidence per criterion
add Python unit test for evidence parsing

Why

When criteria fail, users need concrete evidence to decide whether a failure is legitimate or the criteria should be adjusted. This change makes the evidence first-class in the judge results.

Testing

python -m pytest python/tests/test_judge_agent.py

Notes

Evidence is required in the finish_test tool schema (non‑breaking for callers who use the JudgeAgent directly; it only affects the LLM judge output).

Fixes langwatch#161 ## Problem JudgeAgent intermittently fails with `AttributeError: 'str' object has no attribute 'values'` when the LLM returns the `criteria` field as a JSON string instead of a dictionary object. This occurs at lines 439 and 444 when the code calls `criteria.values()` without verifying that `criteria` is actually a dict. ## Root Cause When the LLM is uncertain about the schema format (particularly with complex dynamic schemas using sanitized criterion text as property names), it sometimes serializes the nested `criteria` object as a JSON string rather than a proper dict. ## Solution Add defensive parsing after extracting criteria from tool call arguments: 1. Check if `criteria` is a string 2. If yes, attempt to parse it with `json.loads()` 3. If parsing fails, log a warning and use empty dict as fallback 4. Additionally verify `criteria` is a dict before calling `.values()` This ensures the code gracefully handles both formats: - Direct dict: `{"criterion_1": "true", "criterion_2": "false"}` - JSON string: `"{\"criterion_1\": \"true\", \"criterion_2\": \"false\"}"` ## Testing - Verified Python syntax with `python -m py_compile` - Fix includes detailed logging for debugging - Graceful fallback prevents test failures ## Impact - Low risk: Only adds defensive parsing with fallback - Fixes intermittent failures reported in issue langwatch#161 - No changes to normal execution path when criteria is already a dict

rajdeepmahal24 added 2 commits December 13, 2025 11:43

Add evidence to judge criteria results

3e8e1ed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evidence field to judge criteria results#215

Add evidence field to judge criteria results#215
rajdeepmahal24 wants to merge 2 commits intolangwatch:mainfrom
rajdeepmahal24:judge-evidence

rajdeepmahal24 commented Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

rajdeepmahal24 commented Jan 17, 2026

Summary

Why

Testing

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments