Description
Question
Hi team,
I've been working with pydantic_ai
for some time now and integrating it into an agentic architecture using both Groq (LLaMA 3.3 70B Versatile) and OpenAI providers.
Here's the repository where the full setup is implemented:
π https://github.com/Rikhil-Nell/Multi-Agentic-RAG
Also here is a deployed streamlit link:
π Streamlit App
(Please note: I'm a student with very limited API credits, so please be mindful if testing.)
The agent is fairly minimal right now β it uses:
- A RAG vector search tool (
retrieve_relevant_documentation
) - A dictionary lookup tool (
call_dictionary
using the Merriam-Webster API)
β The Problem
Tool usage is highly inconsistent. At times, everything works perfectly β the agent recognizes intent, calls the right tool, parses arguments, and returns the result correctly.
However, during certain stretches (seemingly random), the model stops making actual tool calls and instead returns placeholder-style outputs like:
<function=call_dictionary({"word": "suburbs"})</function>
This happens despite:
- The tool functions being cleanly defined using
@tool
decorators - Well-structured prompts that explicitly direct the model to use tools when needed
- Confirmed success of the same codebase and logic at other times
A sample system prompt looks like this:
Be concise, reply professionally. Use the tool
call_dictionary
when asked to define a word. Useretrieve_relevant_documentation
for queries out of scope. Never respond with the tool call string β only invoke tools directly and wait for the result. Never fabricate an answer. Always begin with RAG if unsure.
Iβve verified that:
- The code path is not skipping function execution
- There are no exceptions thrown during successful runs
- This issue does not seem to stem from the tool logic itself
It feels like either:
- The model isn't parsing the system prompt consistently
- Or there's something flaky in the tool call orchestration layer with
pydantic_ai
or Groq integration
π‘ Questions / Help Needed
- Is this a known limitation when using Groq models through
pydantic_ai
? - Are there internal retry mechanisms or validation steps I can hook into?
- Are tool calls non-deterministic across inference providers?
- How can I debug cases where the model outputs a tool call string instead of invoking the function?
I'm not sure if this is a problem with the model, the inference provider, pydantic_ai
, or my orchestration β but I'd greatly appreciate any pointers or support.
yes I used AI for this, I am too emotional and frustrated right now to be able to write a good issue, so I do apologize, I will provide any and all information required for diagnosis if it means my code works please.
Additional Context
Pydantic_ai version: 0.1.10
Python version: 3.13