Skip to content

docs: clarify MLLM MTP guard semantics#491

Open
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:fix/issue-489-mtp-review-notes
Open

docs: clarify MLLM MTP guard semantics#491
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:fix/issue-489-mtp-review-notes

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 commented May 2, 2026

Closes #489.

This is a narrow follow-up for the PR #397 review notes. I am not changing MLLM MTP behavior in this PR; I am making the current safe envelope explicit and tightening regression coverage around the existing guard.

Changes:

  • document that batched MLLM MTP is currently a conservative greedy-only optimization
  • document that request-local processors stay active by default and retirement-to-MTP resume remains explicit opt-in via VLLM_MLX_ENABLE_THINKING_RETIREMENT_RESUME=1
  • document MLLM prefill_step_size fallback behavior and worker stream rebinding scope
  • expand the MLLM MTP guard test so temperature, top_p, top_k, and min_p each independently fall back to the normal scheduler path
  • clarify bind_generation_streams() as a worker-entry thread-ownership guard, not a token-loop optimization

Validation:

  • AI_RUNTIME_BYPASS_SAFETY_GATE=1 /opt/ai-runtime/venv-live/bin/python -m pytest tests/test_mllm_continuous_batching.py -q -k 'MTPGuards or BatchedMLLMConfigWiring'
  • AI_RUNTIME_BYPASS_SAFETY_GATE=1 /opt/ai-runtime/venv-live/bin/python -m pytest tests/test_simple_engine.py tests/test_simple_engine_cancel_serialization.py tests/test_engine_core_thread_streams.py -q -k 'stream or serialized or bind_generation_streams or prelock'
  • AI_RUNTIME_BYPASS_SAFETY_GATE=1 /opt/ai-runtime/venv-live/bin/python -m compileall -q vllm_mlx/mllm_batch_generator.py vllm_mlx/mlx_streams.py
  • black --target-version py312 --check vllm_mlx/mlx_streams.py tests/test_mllm_continuous_batching.py
  • uvx ruff check vllm_mlx/mlx_streams.py tests/test_mllm_continuous_batching.py --select E,F,W --ignore E402,E501,E731,F811,F841
  • git diff --check

Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean docs-and-tests-only PR. No behavioral changes, just documentation and better regression coverage.

Docs — the continuous-batching guide section accurately describes the current MLLM MTP envelope (greedy-only, single-request batch, no logits processors), the opt-in retirement-to-MTP handoff, prefill step size semantics, and stream rebinding scope. The bind_generation_streams() docstring clarification ("ownership fix, not a batching optimization; call at worker-entry boundaries") is valuable given the ongoing #478/#479 worker-thread discussion.

Tests — parametrizing the non-greedy test across all four sampling dimensions (temperature, top_p, top_k, min_p individually) is a solid improvement over the old test that set all four simultaneously. This ensures each guard is independently verified — a regression that only breaks the top_k check would have been invisible before.

Single commit, CI green. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Follow up PR #397 MTP guard and stream-binding review notes

2 participants