docs: clarify MLLM MTP guard semantics#491
Conversation
janhilgard
left a comment
There was a problem hiding this comment.
Clean docs-and-tests-only PR. No behavioral changes, just documentation and better regression coverage.
Docs — the continuous-batching guide section accurately describes the current MLLM MTP envelope (greedy-only, single-request batch, no logits processors), the opt-in retirement-to-MTP handoff, prefill step size semantics, and stream rebinding scope. The bind_generation_streams() docstring clarification ("ownership fix, not a batching optimization; call at worker-entry boundaries") is valuable given the ongoing #478/#479 worker-thread discussion.
Tests — parametrizing the non-greedy test across all four sampling dimensions (temperature, top_p, top_k, min_p individually) is a solid improvement over the old test that set all four simultaneously. This ensures each guard is independently verified — a regression that only breaks the top_k check would have been invisible before.
Single commit, CI green. LGTM.
Closes #489.
This is a narrow follow-up for the PR #397 review notes. I am not changing MLLM MTP behavior in this PR; I am making the current safe envelope explicit and tightening regression coverage around the existing guard.
Changes:
VLLM_MLX_ENABLE_THINKING_RETIREMENT_RESUME=1prefill_step_sizefallback behavior and worker stream rebinding scopetemperature,top_p,top_k, andmin_peach independently fall back to the normal scheduler pathbind_generation_streams()as a worker-entry thread-ownership guard, not a token-loop optimizationValidation:
AI_RUNTIME_BYPASS_SAFETY_GATE=1 /opt/ai-runtime/venv-live/bin/python -m pytest tests/test_mllm_continuous_batching.py -q -k 'MTPGuards or BatchedMLLMConfigWiring'AI_RUNTIME_BYPASS_SAFETY_GATE=1 /opt/ai-runtime/venv-live/bin/python -m pytest tests/test_simple_engine.py tests/test_simple_engine_cancel_serialization.py tests/test_engine_core_thread_streams.py -q -k 'stream or serialized or bind_generation_streams or prelock'AI_RUNTIME_BYPASS_SAFETY_GATE=1 /opt/ai-runtime/venv-live/bin/python -m compileall -q vllm_mlx/mllm_batch_generator.py vllm_mlx/mlx_streams.pyblack --target-version py312 --check vllm_mlx/mlx_streams.py tests/test_mllm_continuous_batching.pyuvx ruff check vllm_mlx/mlx_streams.py tests/test_mllm_continuous_batching.py --select E,F,W --ignore E402,E501,E731,F811,F841git diff --check