When using the pydantic-ai integration with ConfidentInstrumentationSettings, the integration does not capture actual_output, tools_called, or expected_tools from the agent execution trace. All three are None in the resulting test case, which causes ToolCorrectnessMetric to crash with:
deepeval.errors.MissingTestCaseParamsError: 'tools_called' and 'expected_tools' cannot be None for the 'Tool Correctness' metric
And even for metrics that don't require tools (e.g., AnswerRelevancyMetric), actual_output is None so scoring fails.
Versions:
deepeval==3.8.6
pydantic-ai==1.31.0
Python 3.12
Minimal reproduction:
import asyncio
from pydantic_ai import Agent
from deepeval.integrations.pydantic_ai.instrumentator import ConfidentInstrumentationSettings
from deepeval.metrics import ToolCorrectnessMetric, AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.test_case import ToolCall
metrics = [ToolCorrectnessMetric(), AnswerRelevancyMetric()]
agent = Agent(
"gpt-4.1-mini",
instructions="You are a helpful assistant.",
instrument=ConfidentInstrumentationSettings(
is_test_mode=True,
agent_metrics=metrics,
),
)
dataset = EvaluationDataset(
goldens=[
Golden(
input="What does NDA stand for?",
expected_tools=[ToolCall(name="some_tool")],
),
]
)
async def run_agent(input_text: str):
result = await agent.run(input_text)
return result.output
for golden in dataset.evals_iterator(metrics=metrics):
task = asyncio.create_task(run_agent(golden.input))
dataset.evaluate(task)
When using the pydantic-ai integration with ConfidentInstrumentationSettings, the integration does not capture actual_output, tools_called, or expected_tools from the agent execution trace. All three are None in the resulting test case, which causes ToolCorrectnessMetric to crash with:
deepeval.errors.MissingTestCaseParamsError: 'tools_called' and 'expected_tools' cannot be None for the 'Tool Correctness' metricAnd even for metrics that don't require tools (e.g., AnswerRelevancyMetric), actual_output is None so scoring fails.
Versions:
deepeval==3.8.6pydantic-ai==1.31.0Python 3.12Minimal reproduction: