fix: overhaul GLM-4.7-Flash streaming tool calls and add GLM4 reasoning parser by b2ornot2b · Pull Request #246 · waybarrios/vllm-mlx

b2ornot2b · 2026-04-02T01:49:12Z

Summary

This PR fixes the broken streaming tool-calling implementation for GLM-4.7-Flash and introduces a dedicated GLM4ReasoningParser to handle the model's reasoning/thinking behavior.

The Problem

The previous implementation failed in streaming mode due to:

Server Loop Skipping: The streaming loop in server.py had a greedy continue that skipped tool parsing whenever the reasoning parser returned None (e.g., during transitions).
Token Fragmentation: Fragile string matching ("</tool_call>" in delta_text) failed when BPE tokens arrived in separate chunks.
State Management: Lack of persistent state in the tool parser caused loss of metadata (id, type, name) in final chunks.

The Solution

Server Overhaul (server.py): Removed the greedy continue and implemented strict fall-through logic. The tool parser is now always evaluated, ensuring transitions from reasoning to tool calling are captured.
State-Machine Tool Parser (glm47_tool_parser.py): Replaced string matching with an index-driven state machine. It tracks last_parsed_index and maintains explicit states (PARSING_NAME, PARSING_ARGUMENTS), allowing it to handle fragmented tokens and stream arguments incrementally.
GLM4 Reasoning Parser: Added a dedicated parser to correctly handle GLM-4's specific reasoning tags and prompt injection behavior.
Zero-Argument Fix: Fixed a crash when tool calls contained no arguments.

Key Changes

vllm_mlx/server.py: Modified streaming loop for reliable tool call detection.
vllm_mlx/tool_parsers/glm47_tool_parser.py: Complete rewrite using a robust state machine.
vllm_mlx/reasoning/glm4_parser.py: New parser for GLM-4 reasoning.
vllm_mlx/reasoning/__init__.py: Registered the new parser.
tests/test_tool_parsers.py & tests/test_reasoning_parser.py: Added unit tests for the new logic.

Verification

Verified with mlx-community/GLM-4.7-Flash-bf16.
Both non-streaming and streaming tool calls (including parallel calls) now work correctly.
Reasoning content is correctly identified and separated from tool calls.

The function was missing a return statement after successfully loading the model with mlx_lm.load(), causing None to be returned and unpacking to fail with TypeError.

…for GLM4

Thump604

Thanks for fixing GLM-4.7 streaming tool calls. The state-machine approach is the right call, and the server.py restructuring addresses a real problem (tool parsing unreachable when reasoning parser is active). A few things to address before this is ready:

Dead code in glm47_tool_parser.py: After the return None at line 705 (end of the while loop), there are ~200 lines of unreachable code (a second state-machine implementation). This includes a recursive call and references to self.current_arguments which is never initialized. Looks like an earlier draft that was left in. Please remove it.
server.py tool parsing input: The tool parser receives delta_text as input (line ~2062), but when a reasoning parser is active, delta_text still contains raw text with <think>/</think> tags. The tool parser should receive the content portion (after reasoning extraction), not the raw delta. Otherwise the tool parser may misparse reasoning tags.
__all__ in reasoning/__init__.py: Qwen3ReasoningParser and DeepSeekR1ReasoningParser are added to __all__ but not imported at module level. This will break from vllm_mlx.reasoning import Qwen3ReasoningParser. Either add the imports or remove them from __all__.
GLM4ReasoningParser is a no-op: The extract_reasoning override is identical to the base class (BaseThinkingReasoningParser) implementation. If the only purpose is to register under the name "glm4", you could just do register_parser("glm4", BaseThinkingReasoningParser) or create a minimal subclass without overriding anything. Not a blocker, but simplifies things.

waybarrios · 2026-04-11T02:17:58Z

@b2ornot2b hey, are you still working on this? There are a few things that need to be addressed before we can merge (dead code after line 705, __all__ exports, GLM4 parser simplification, etc. — see @Thump604's review above).

If you need help or don't have time to finish it, let me know — I'm happy to pick up the remaining changes and get this merged.

b2ornot2b · 2026-04-11T02:20:37Z

Would appreciate if you could pick up the remaining changes. Am travelling, and will not be able to get back to this or my test setup.

…

On Sat, 11 Apr 2026 at 7:48 AM, Wayner Barrios ***@***.***> wrote: *waybarrios* left a comment (waybarrios/vllm-mlx#246) <#246 (comment)> @b2ornot2b <https://github.com/b2ornot2b> hey, are you still working on this? There are a few things that need to be addressed before we can merge (dead code after line 705, __all__ exports, GLM4 parser simplification, etc. — see @Thump604 <https://github.com/Thump604>'s review above). If you need help or don't have time to finish it, let me know — I'm happy to pick up the remaining changes and get this merged. — Reply to this email directly, view it on GitHub <#246 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEXPXPP4IJHDZVMTZYHZO34VGTO5AVCNFSM6AAAAACXJ554J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DEMRXHAZTGMBXG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

janhilgard

Agreeing with Thump604's review. The state-machine rewrite of GLM-4.7 streaming is the right approach, and the server.py restructuring (making tool parsing reachable when reasoning parser is active) fixes a real bug. But this PR needs cleanup before merge.

Echoing Thump604's points and adding specifics:

1. Dead code after line 705 is the biggest issue. After return None at the end of the while loop, there are ~180 lines of a second, unreachable state-machine implementation. This includes:

A recursive self.extract_tool_calls_streaming(...) call (line 733) -- recursive state machines are fragile
References to self.current_arguments (line 728) which is never initialized in __init__ or reset()
Duplicate logic for every state that already exists in the while loop above

This is clearly a leftover draft. Since @b2ornot2b has asked @waybarrios to pick this up, removing it should be straightforward.

2. __all__ exports are broken. Qwen3ReasoningParser and DeepSeekR1ReasoningParser are added to __all__ in reasoning/__init__.py but are not imported at module level (they are only imported inside _register_builtin_parsers()). This means:

from vllm_mlx.reasoning import Qwen3ReasoningParser  # ImportError

Either add top-level imports or remove them from __all__.

3. GLM4ReasoningParser extract_reasoning is identical to the base class. The four cases (both tags, only end tag, only start tag, no tags) are exactly what BaseThinkingReasoningParser already does. The subclass only needs to define start_token and end_token. The override can be deleted entirely.

4. server.py tool parser receives raw delta_text. When reasoning parser is active, delta_text still contains <think>/</think> markup. The tool parser should operate on the post-reasoning content, not the raw delta. Otherwise a reasoning block containing <tool_call> (e.g., the model thinking about tool calling) could trigger a false positive in the tool parser.

5. tokenizer.py change is unrelated. The return model, tokenizer added after line 54 (before the except ValueError) means the fallback path for non-standard tokenizers is now unreachable for the success case. This looks correct (early return on success, fallback on error), but it is a separate fix from the GLM-4.7 streaming work and should ideally be in its own commit/PR for clean history.

6. Branch has merge conflicts with current main.

The core contribution (state-machine streaming parser + server.py restructuring) is valuable. Looking forward to seeing the cleanup.

perry2of5 · 2026-04-25T18:57:46Z

It looks like this was at least partially addressed in another PR including PR #295. I there anything left we need to do with this?

* Add glm4 test from b2ornot2b Merged tests from #246 * Format changed test files with black.

perry2of5 · 2026-04-29T21:53:14Z

Based on my review of other PRs, I believe all of this has been taken care of now. I manually copied the two missing from this PR to PR #440 which was just merged to main just now. The rest of it was covered in other PRs.

Thump604 · 2026-04-29T22:32:22Z

Hi @b2ornot2b -- this GLM-4 PR has been open since early April with review feedback and merge conflicts. Are you still working on it? If you plan to continue, a rebase against current main and addressing the review comments would move it forward. Otherwise we can close it and you're welcome to refile when ready. Will check back in two weeks.

b2ornot2b added 4 commits March 31, 2026 07:53

Fix: return model and tokenizer from load_model_with_fallback

535025b

The function was missing a return statement after successfully loading the model with mlx_lm.load(), causing None to be returned and unpacking to fail with TypeError.

merged from upstream

4ecdab2

feat: implement GLM4 reasoning parser and fix zero-arg tool calls

7c0b008

fix: overhaul streaming tool calling with index-driven state machine …

429c93a

…for GLM4

Thump604 requested changes Apr 2, 2026

View reviewed changes

janhilgard reviewed Apr 15, 2026

View reviewed changes

perry2of5 mentioned this pull request Apr 25, 2026

Fix/glm4 add validation tests from b2ornot2b #440

Merged

Thump604 pushed a commit that referenced this pull request Apr 29, 2026

Fix/glm4 add validation tests from b2ornot2b (#440)

0c87c9c

* Add glm4 test from b2ornot2b Merged tests from #246 * Format changed test files with black.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: overhaul GLM-4.7-Flash streaming tool calls and add GLM4 reasoning parser#246

fix: overhaul GLM-4.7-Flash streaming tool calls and add GLM4 reasoning parser#246
b2ornot2b wants to merge 4 commits intowaybarrios:mainfrom
b2ornot2b:fix/glm4-streaming-tool-calls

b2ornot2b commented Apr 2, 2026

Uh oh!

Thump604 left a comment

Uh oh!

waybarrios commented Apr 11, 2026

Uh oh!

b2ornot2b commented Apr 11, 2026 via email

Uh oh!

janhilgard left a comment

Uh oh!

perry2of5 commented Apr 25, 2026

Uh oh!

perry2of5 commented Apr 29, 2026

Uh oh!

Thump604 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

b2ornot2b commented Apr 2, 2026

Summary

The Problem

The Solution

Key Changes

Verification

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

waybarrios commented Apr 11, 2026

Uh oh!

b2ornot2b commented Apr 11, 2026 via email

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Uh oh!

perry2of5 commented Apr 25, 2026

Uh oh!

perry2of5 commented Apr 29, 2026

Uh oh!

Thump604 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants