Add evidence field to judge criteria results#215
Open
rajdeepmahal24 wants to merge 2 commits intolangwatch:mainfrom
Open
Add evidence field to judge criteria results#215rajdeepmahal24 wants to merge 2 commits intolangwatch:mainfrom
rajdeepmahal24 wants to merge 2 commits intolangwatch:mainfrom
Conversation
Fixes langwatch#161 ## Problem JudgeAgent intermittently fails with `AttributeError: 'str' object has no attribute 'values'` when the LLM returns the `criteria` field as a JSON string instead of a dictionary object. This occurs at lines 439 and 444 when the code calls `criteria.values()` without verifying that `criteria` is actually a dict. ## Root Cause When the LLM is uncertain about the schema format (particularly with complex dynamic schemas using sanitized criterion text as property names), it sometimes serializes the nested `criteria` object as a JSON string rather than a proper dict. ## Solution Add defensive parsing after extracting criteria from tool call arguments: 1. Check if `criteria` is a string 2. If yes, attempt to parse it with `json.loads()` 3. If parsing fails, log a warning and use empty dict as fallback 4. Additionally verify `criteria` is a dict before calling `.values()` This ensures the code gracefully handles both formats: - Direct dict: `{"criterion_1": "true", "criterion_2": "false"}` - JSON string: `"{\"criterion_1\": \"true\", \"criterion_2\": \"false\"}"` ## Testing - Verified Python syntax with `python -m py_compile` - Fix includes detailed logging for debugging - Graceful fallback prevents test failures ## Impact - Low risk: Only adds defensive parsing with fallback - Fixes intermittent failures reported in issue langwatch#161 - No changes to normal execution path when criteria is already a dict
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Why
When criteria fail, users need concrete evidence to decide whether a failure is legitimate or the criteria should be adjusted. This change makes the evidence first-class in the judge results.
Testing
Notes