docs: clarify MLLM MTP guard semantics by Thump604 · Pull Request #491 · waybarrios/vllm-mlx

Thump604 · 2026-05-02T05:12:43Z

Closes #489.

This is a narrow follow-up for the PR #397 review notes. I am not changing MLLM MTP behavior in this PR; I am making the current safe envelope explicit and tightening regression coverage around the existing guard.

Changes:

document that batched MLLM MTP is currently a conservative greedy-only optimization
document that request-local processors stay active by default and retirement-to-MTP resume remains explicit opt-in via VLLM_MLX_ENABLE_THINKING_RETIREMENT_RESUME=1
document MLLM prefill_step_size fallback behavior and worker stream rebinding scope
expand the MLLM MTP guard test so temperature, top_p, top_k, and min_p each independently fall back to the normal scheduler path
clarify bind_generation_streams() as a worker-entry thread-ownership guard, not a token-loop optimization

Validation:

AI_RUNTIME_BYPASS_SAFETY_GATE=1 /opt/ai-runtime/venv-live/bin/python -m pytest tests/test_mllm_continuous_batching.py -q -k 'MTPGuards or BatchedMLLMConfigWiring'
AI_RUNTIME_BYPASS_SAFETY_GATE=1 /opt/ai-runtime/venv-live/bin/python -m pytest tests/test_simple_engine.py tests/test_simple_engine_cancel_serialization.py tests/test_engine_core_thread_streams.py -q -k 'stream or serialized or bind_generation_streams or prelock'
AI_RUNTIME_BYPASS_SAFETY_GATE=1 /opt/ai-runtime/venv-live/bin/python -m compileall -q vllm_mlx/mllm_batch_generator.py vllm_mlx/mlx_streams.py
black --target-version py312 --check vllm_mlx/mlx_streams.py tests/test_mllm_continuous_batching.py
uvx ruff check vllm_mlx/mlx_streams.py tests/test_mllm_continuous_batching.py --select E,F,W --ignore E402,E501,E731,F811,F841
git diff --check

janhilgard

Clean docs-and-tests-only PR. No behavioral changes, just documentation and better regression coverage.

Docs — the continuous-batching guide section accurately describes the current MLLM MTP envelope (greedy-only, single-request batch, no logits processors), the opt-in retirement-to-MTP handoff, prefill step size semantics, and stream rebinding scope. The bind_generation_streams() docstring clarification ("ownership fix, not a batching optimization; call at worker-entry boundaries") is valuable given the ongoing #478/#479 worker-thread discussion.

Tests — parametrizing the non-greedy test across all four sampling dimensions (temperature, top_p, top_k, min_p individually) is a solid improvement over the old test that set all four simultaneously. This ensures each guard is independently verified — a regression that only breaks the top_k check would have been invisible before.

Single commit, CI green. LGTM.

docs: clarify MLLM MTP guard semantics

4bc5dd1

janhilgard approved these changes May 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: clarify MLLM MTP guard semantics#491

docs: clarify MLLM MTP guard semantics#491
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:fix/issue-489-mtp-review-notes

Thump604 commented May 2, 2026

Uh oh!

janhilgard left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Thump604 commented May 2, 2026

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants