Tool Calls Occasionally Fail Silently or Return Incomplete Outputs (Groq + Pydantic AI Integration)

### Question

Hi team,

I've been working with `pydantic_ai` for some time now and integrating it into an agentic architecture using both Groq (LLaMA 3.3 70B Versatile) and OpenAI providers.

Here's the repository where the full setup is implemented:
🔗 [https://github.com/Rikhil-Nell/Multi-Agentic-RAG](https://github.com/Rikhil-Nell/Multi-Agentic-RAG)

Also here is a deployed streamlit link:
🔗 [Streamlit App](https://multi-agentic-rag.streamlit.app/)
(Please note: I'm a student with very limited API credits, so please be mindful if testing.)

The agent is fairly minimal right now — it uses:

* A RAG vector search tool (`retrieve_relevant_documentation`)
* A dictionary lookup tool (`call_dictionary` using the Merriam-Webster API)

### ❗ The Problem

Tool usage is **highly inconsistent**. At times, everything works perfectly — the agent recognizes intent, calls the right tool, parses arguments, and returns the result correctly.

However, during certain stretches (seemingly random), the model stops making actual tool calls and instead returns placeholder-style outputs like:

```
<function=call_dictionary({"word": "suburbs"})</function>
```

This happens **despite**:

* The tool functions being cleanly defined using `@tool` decorators
* Well-structured prompts that explicitly direct the model to use tools when needed
* Confirmed success of the same codebase and logic at other times

A sample system prompt looks like this:

> Be concise, reply professionally. Use the tool `call_dictionary` when asked to define a word. Use `retrieve_relevant_documentation` for queries out of scope. Never respond with the tool call string — only invoke tools directly and wait for the result. Never fabricate an answer. Always begin with RAG if unsure.

I’ve verified that:

* The code path is not skipping function execution
* There are no exceptions thrown during successful runs
* This issue does **not** seem to stem from the tool logic itself

It feels like either:

* The model isn't parsing the system prompt consistently
* Or there's something flaky in the tool call orchestration layer with `pydantic_ai` or Groq integration

### 💡 Questions / Help Needed

* Is this a known limitation when using Groq models through `pydantic_ai`?
* Are there internal retry mechanisms or validation steps I can hook into?
* Are tool calls non-deterministic across inference providers?
* How can I debug cases where the model outputs a tool call string instead of invoking the function?

I'm not sure if this is a problem with the model, the inference provider, `pydantic_ai`, or my orchestration — but I'd greatly appreciate any pointers or support.

---

yes I used AI for this, I am too emotional and frustrated right now to be able to write a good issue, so I do apologize, I will provide any and all information required for diagnosis if it means my code works please.


### Additional Context

Pydantic_ai version: 0.1.10
Python version: 3.13


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tool Calls Occasionally Fail Silently or Return Incomplete Outputs (Groq + Pydantic AI Integration) #1714

Question

❗ The Problem

💡 Questions / Help Needed

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tool Calls Occasionally Fail Silently or Return Incomplete Outputs (Groq + Pydantic AI Integration) #1714

Description

Question

❗ The Problem

💡 Questions / Help Needed

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions