Why
"Wrong tool args" is one of the top regression modes for tool-calling agents — the right tool gets picked but with a bad payload, and unless you're validating, the diff just looks like an output change.
Pydantic AI already declares typed tool schemas. We can use those schemas at diff time to flag "the tool was called with arguments that don't validate" as a first-class regression class, separate from TOOLS_CHANGED.
What
Extend the Pydantic AI adapter (evalview/adapters/pydantic_ai_adapter.py) and the tool-call evaluator to surface schema-validation failures as a distinct reason code.
Acceptance criteria
Hints
evalview/evaluators/ has the orchestrator and per-eval modules.
evalview/core/types.py is where ReasonCode lives.
- Keep it Pydantic-AI-specific for now; we can generalize to a
ToolSchemaProvider ABC in a later PR if other adapters want to opt in.
Size
~2-3 hours.
Why
"Wrong tool args" is one of the top regression modes for tool-calling agents — the right tool gets picked but with a bad payload, and unless you're validating, the diff just looks like an output change.
Pydantic AI already declares typed tool schemas. We can use those schemas at diff time to flag "the tool was called with arguments that don't validate" as a first-class regression class, separate from
TOOLS_CHANGED.What
Extend the Pydantic AI adapter (
evalview/adapters/pydantic_ai_adapter.py) and the tool-call evaluator to surface schema-validation failures as a distinct reason code.Acceptance criteria
ReasonCode(or extension of an existing one) for "tool args failed schema validation"TOOLS_CHANGEDandREGRESSION?)tests/evaluators/covering: valid args pass, invalid args flaggedHints
evalview/evaluators/has the orchestrator and per-eval modules.evalview/core/types.pyis whereReasonCodelives.ToolSchemaProviderABC in a later PR if other adapters want to opt in.Size
~2-3 hours.