Skip to content

Expose effective MLLM MTP draft stats#473

Open
Thump604 wants to merge 3 commits intowaybarrios:mainfrom
Thump604:fix/issue-471-mllm-mtp-observability
Open

Expose effective MLLM MTP draft stats#473
Thump604 wants to merge 3 commits intowaybarrios:mainfrom
Thump604:fix/issue-471-mllm-mtp-observability

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

Summary

This makes the MLLM MTP path observable and less misleading for #471.

The current MLLM path can inject MTP weights and print that MTP is enabled with the requested draft count, while the actual generator path is constrained to one effective draft token per verify step and bypasses MTP for prefill, concurrent batches, non-greedy sampling, and active logits processors. That makes a --mtp-num-draft-tokens 5 run look enabled even when it cannot provide the expected speedup.

Changes:

  • rename the startup line from draft_tokens to requested_draft_tokens
  • print the current MLLM effective-draft limitation at startup
  • expose /v1/status.mtp for MLLM with requested/effective draft count, attempts, accepted, rejected, errors, acceptance rate, and bypass reasons
  • add regression assertions around MLLM MTP stats and bypass behavior

This PR does not claim to make MLLM MTP faster. It makes the runtime honest enough to measure whether the path is doing useful speculative work before a follow-up performance change.

Refs #471.

Validation

  • python -m py_compile vllm_mlx/mllm_batch_generator.py vllm_mlx/mllm_scheduler.py vllm_mlx/engine/batched.py vllm_mlx/server.py vllm_mlx/cli.py tests/test_mllm_continuous_batching.py
  • git diff --check

I did not run the MLX-importing unit test file locally because a resident qualification model is active on this machine. CI should exercise the added test assertions.

Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, small observability addition. A few notes:

bypass_reasons is static documentation, not runtime telemetry

bypass_reasons is a hardcoded dict of all possible bypass conditions — it's the same regardless of which ones are actually firing. This is useful as inline documentation, but for diagnosing #471-style problems it would be far more actionable as dynamic counters:

"bypass_counts": {
    "prefill": 0,
    "no_active_batch": 0,
    "concurrent_batch": 0,
    "non_greedy_sampling": 142,  # <-- this tells you WHY attempted=0
    "logits_processors": 0,
}

Without this, an operator sees attempted=0 and still has to read source to figure out which bypass is hitting. Not a blocker, but would make this much more useful.

Note on #471 root cause

Separate from this PR, but worth mentioning since it references #471: the benchmark in that issue uses 4-bit quantized MTP weights (scripts/add_mtp_weights_qwen35.py with default quantization → 314 MB). From production experience, quantized MTP weights give 0% acceptance rate — the quantization error is baked into the draft logits, making every draft token wrong.

The fix is --no-quantize flag to extract native BF16 weights from the original HF model (e.g., Qwen/Qwen3.6-27B). See vllm-project/vllm#36331 for background. With BF16 MTP weights, acceptance rates of 78-85% are expected on Qwen3.5/3.6 models.

Minor

  • server.py main() also has an MTP startup log (print(f"MTP: enabled, ...")) at ~line 5412 — it wasn't updated to match the cli.py rename from draft_tokens to requested_draft_tokens. Worth keeping them consistent.

CI 9/9 green. No merge conflicts.

@Thump604
Copy link
Copy Markdown
Collaborator Author

Thanks for the review.

bypass_reasons as dynamic counters -- agreed, that's more useful. I'll change bypass_reasons from a static dict to runtime counters that increment per bypass condition. Operators will see which bypass is actually firing without reading source.

server.py startup log consistency -- will update the server.py main() MTP log to use requested_draft_tokens to match cli.py.

#471 quantized MTP weights -- that's a good catch. Worth commenting on #471 directly so the reporter knows BF16 MTP weights are needed for real acceptance rates. I'll add a note there.

black lint failure -- mllm_batch_generator.py needs a formatting fix. Will push that with the bypass_counts and startup log fixes.

Copy link
Copy Markdown
Owner

@waybarrios waybarrios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Thump604 thanks for the ping. Please address the points below before this is ready to merge.

In vllm_mlx/server.py the /v1/status handler now looks like this:

"cache": stats.get("memory_aware_cache")
or stats.get("paged_cache")
or stats.get("prefix_cache"),
"mtp": stats.get("mtp"),
"requests": stats.get("requests", []),

When MTP isn't active, stats.get("mtp") returns None and the endpoint emits "mtp": null. Any external client that does response["mtp"]["enabled"] will eat a TypeError. Other fields like cache already use the or pattern to collapse to something non-null, this one was left dangling. Please either return a sentinel dict like {"enabled": False} or omit the key with if stats.get("mtp") is not None.

In vllm_mlx/mllm_batch_generator.py _get_mtp_stats computes the acceptance rate by reading two fields from the shared dict without a lock:

def _get_mtp_stats() -> Dict[str, Any]:
    verified = _mtp_stats["accepted"] + _mtp_stats["rejected"]
    acceptance_rate = _mtp_stats["accepted"] / verified if verified > 0 else 0.0

The engine thread is incrementing accepted/rejected while the HTTP handler reads. The GIL makes each += atomic on its own, but reading the two fields is not an atomic snapshot, so every once in a while you'll see acceptance_rate > 1.0 or transient under-estimates. Cosmetic today, but the moment this moves to a worker in another process it stops being cosmetic. Please snapshot both reads under a lock, or pull them into local variables before the computation.

Also, bypass_reasons ended up as a static dict with hardcoded strings:

"bypass_reasons": {
    "prefill": "input_tokens.shape[1] > 1",
    "no_active_batch": "active_batch is None",
    "concurrent_batch": "len(active_batch) > 1",
    "non_greedy_sampling": "temperature/top_p/top_k/min_p not greedy",
    "logits_processors": "request-local logits processors active",
},

That documents the guards but doesn't stay in sync with the actual code. The day someone relaxes the concurrent_batch guard (the comment inside _mtp_step already anticipates that's possible once a sampler-aware verifier exists), this dict will silently lie. Please turn the keys into real counters (increment them each time the bypass fires) and move the descriptions to a comment next to the guards, not in the payload.

One more thing to note (probably out of scope for this PR, fine as a follow-up): observability only covers the MLLM path. The _install_mtp for the non-MLLM path in vllm_mlx/scheduler.py keeps its own _mtp_stats that is never surfaced, so running with --enable-mtp without --mllm returns "mtp": null even though MTP is doing work. Worth opening a follow-up ticket for parity.

Copy link
Copy Markdown
Owner

@waybarrios waybarrios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the comments

@Thump604
Copy link
Copy Markdown
Collaborator Author

Pushed 7470b12 addressing the requested changes.

Covered:

  • /v1/status now returns a stable mtp object when MTP is inactive: {"enabled": false} instead of null.
  • MLLM MTP stats now use a lock and snapshot counters before computing acceptance_rate.
  • Static bypass_reasons was replaced with dynamic bypass_counts.
  • The bypass predicates are documented next to the guards instead of duplicated in the payload.
  • Added regression coverage for all current bypass counters: prefill, no_active_batch, concurrent_batch, non_greedy_sampling, and logits_processors.
  • Added a status endpoint regression for the inactive-MTP payload shape.

Validation:

  • py_compile: mllm_batch_generator.py, mllm_scheduler.py, engine/batched.py, server.py, cli.py, tests/test_mllm_continuous_batching.py, tests/test_lifecycle_server.py
  • focused tests: TestMLLMBatchGeneratorMTPGuards plus the new status-shape test, 9 passed
  • git diff --check: clean
  • ruff check vllm_mlx/ tests/ --select E,F,W --ignore E402,E501,E731,F811,F841: clean
  • black --check vllm_mlx/ tests/: clean

Notes:

@Thump604 Thump604 requested a review from waybarrios April 30, 2026 00:59
Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean observability PR. The stats design is well thought out:

  • attempted counts actual MTP forward calls (not bypassed steps) — semantics are correct: attempted == accepted + rejected + errors
  • Bypass counts track each reason independently (a single step can trigger multiple), which is the right model for understanding frequency of each condition
  • acceptance_rate correctly excludes errors from the denominator
  • Thread-safe with _mtp_stats_lock for concurrent reads from /v1/status
  • effective_draft_tokens: 1 is honest about the current MLLM limitation
  • Startup message transparently tells the operator what to expect
  • Default {"enabled": False} keeps the /v1/status shape consistent when MTP is off

Tests cover all bypass categories (prefill, no_active_batch, concurrent_batch, non_greedy_sampling, logits_processors) and the happy path (attempted/accepted/acceptance_rate). The structural bypass test exercising three reasons in sequence is particularly thorough.

Two minor observations (non-blocking):

  1. Bypass counts can overlap — a step that hits both no_active_batch and prefill increments both counters. This is the right behavior for per-condition frequency analysis, but worth noting that sum(bypass_counts.values()) may exceed actual bypass count. A brief docstring mention could help future readers.

  2. 3 commits — repo convention is single squash commit per PR, but that's your call as maintainer.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants