Skip to content

Track admission-control invariant for serialized TextModel-direct routes#514

Draft
waybarrios wants to merge 1 commit intomainfrom
track-textmodel-direct-admission-invariant
Draft

Track admission-control invariant for serialized TextModel-direct routes#514
waybarrios wants to merge 1 commit intomainfrom
track-textmodel-direct-admission-invariant

Conversation

@waybarrios
Copy link
Copy Markdown
Owner

@waybarrios waybarrios commented May 7, 2026

Closes #495.

Summary

Pins the admission-control invariant for serialized TextModel-direct
generation paths so any future change reintroducing the failure mode in
issue #495 breaks the test suite. Documents SimpleEngine._generation_lock
as Metal-serialization-only with a #495 reference, and adds a small
regression test module that enforces the invariant statically.

Background

Issue #495 documents a P0 hit by a downstream operator: text-only MLLM
requests bypassed the MLLM scheduler and entered a serialized
TextModel-direct generation path guarded by a single asyncio.Lock
with a 120-second wait bound. Concurrent agent traffic piled up behind
that lock instead of receiving a fast retryable admission result.

Current main does not have the bug. Text-only MLLM requests route
through MLLMScheduler, and SimpleEngine._generation_lock exists
only to serialize Metal command-buffer access. The risk the issue
flags is regression: a future change repurposing the lock as a
wait-mode admission gate, or adding a TextModel-direct route without
fail-fast admission.

The acceptance criteria from #495:

  • A concurrent request behind an occupied serialized TextModel-direct
    route must fail fast rather than waiting for minutes.
  • The error must be machine-readable, e.g. text_generation_busy,
    with HTTP 503.
  • Tests must prove no long-waiter pileup occurs.
  • Comments must not imply the serialized route proves scheduler
    batching or queue absorption.

Changes

vllm_mlx/engine/simple.py — extends the existing comment on
_generation_lock to spell out the invariant and reference #495 so
that future readers see the admission-control contract before
repurposing the lock:

# Lock to serialize MLX operations (prevents Metal command buffer conflicts).
# This lock guards Metal command-buffer access only; it is NOT a
# request-admission gate. Issue #495 asks that any future serialized
# TextModel-direct route must implement fail-fast admission (retryable
# 503 with `text_generation_busy`) instead of repurposing this lock as
# a wait-mode admission queue, since long waiters cause request pileup
# under agent traffic.
self._generation_lock = asyncio.Lock()

tests/test_textmodel_direct_admission_invariant.py — new file with
5 cases:

  • test_no_textmodel_direct_class_exists — scans vllm_mlx/ for any
    TextModelDirect-style identifier.
  • test_generation_lock_is_documented_as_metal_only — reads the
    comment block above _generation_lock and asserts it mentions
    Metal/command-buffer and references #495.
  • test_no_long_wait_for_in_simple_engine_text_paths — AST-walks
    simple.py and flags any asyncio.wait_for(..., timeout=T) with
    T >= 5 seconds. The original P0 used timeout=120.
  • test_simple_engine_does_not_expose_admission_queue_attribute
    inspects SimpleEngine source for forbidden attribute names like
    _text_admission_queue, _text_direct_lock, etc.
  • test_text_generation_busy_error_if_present_is_machine_readable
    walks the package for any text_generation_busy-named symbol. If
    none exists (today's state), the test skips; the moment someone
    adds one, the test asserts HTTP 503 status and a machine-readable
    code, locking in the contract from the issue.

Verification

The invariants the tests check, exercised against the live code:

======================================================================
1) No `TextModelDirect` identifier in the package
======================================================================
matches : []

======================================================================
2) `_generation_lock` comment is Metal-only and references #495
======================================================================
        # Lock to serialize MLX operations (prevents Metal command buffer conflicts).
        # This lock guards Metal command-buffer access only; it is NOT a
        # request-admission gate. Issue #495 asks that any future serialized
        # TextModel-direct route must implement fail-fast admission (retryable
        # 503 with `text_generation_busy`) instead of repurposing this lock as
        # a wait-mode admission queue, since long waiters cause request pileup
        # under agent traffic.
        self._generation_lock = asyncio.Lock()

======================================================================
3) No `asyncio.wait_for(..., timeout >= 5s)` in simple.py
======================================================================
long-timeout wait_for sites in simple.py: []

======================================================================
4) `SimpleEngine` does not expose an admission-queue attribute
======================================================================
forbidden attributes present in SimpleEngine: []

======================================================================
5) `text_generation_busy` future-proof check
======================================================================
text_generation_busy symbols defined: []

Test run (clean state):

============================= test session starts ==============================
collected 5 items

tests/test_textmodel_direct_admission_invariant.py::test_no_textmodel_direct_class_exists                          PASSED [ 20%]
tests/test_textmodel_direct_admission_invariant.py::test_generation_lock_is_documented_as_metal_only               PASSED [ 40%]
tests/test_textmodel_direct_admission_invariant.py::test_no_long_wait_for_in_simple_engine_text_paths              PASSED [ 60%]
tests/test_textmodel_direct_admission_invariant.py::test_simple_engine_does_not_expose_admission_queue_attribute   PASSED [ 80%]
tests/test_textmodel_direct_admission_invariant.py::test_text_generation_busy_error_if_present_is_machine_readable SKIPPED [100%]

========================= 4 passed, 1 skipped in 5.10s =========================

Demo: tests catch the exact regression from #495

To prove the tests aren't vacuous, I temporarily injected the precise
shape of the bug from the issue into simple.py:

# Lock to serialize MLX operations (prevents Metal command buffer conflicts).
# Also guards admission for the TextModelDirect path (waits up to 120s).
self._generation_lock = asyncio.Lock()
self._text_direct_lock = asyncio.Lock()  # admission queue for TextModelDirect

async def _serialized_textmodel_direct(self):
    # Wait up to 120s for the TextModelDirect admission slot.
    await asyncio.wait_for(self._text_direct_lock.acquire(), timeout=120)

All four tripwires fired at once with actionable messages:

FAILED test_no_textmodel_direct_class_exists
  Found TextModelDirect-style identifier in upstream code.
  Per issue #495 this route must not be revived without fail-fast admission.
  Matches: ['vllm_mlx/engine/simple.py: TextModelDirect', ...]

FAILED test_generation_lock_is_documented_as_metal_only
  _generation_lock comment must reference issue #495 so future readers
  see the admission-control invariant before repurposing the lock.
  assert '#495' in '# Lock to serialize MLX operations (...).
   # Also guards admission for the TextModelDirect path (waits up to 120s).'

FAILED test_no_long_wait_for_in_simple_engine_text_paths
  Found asyncio.wait_for with a long timeout in vllm_mlx/engine/simple.py.
  Issue #495 asks that any serialized TextModel-direct route fail fast
  rather than wait. Offending sites: ['line 181: wait_for(..., timeout=120)']

FAILED test_simple_engine_does_not_expose_admission_queue_attribute
  SimpleEngine declares attribute(s) that look like a serialized
  TextModel-direct admission queue, which issue #495 forbids without
  fail-fast admission semantics: ['_text_direct_lock']

========================= 4 failed, 1 skipped in 5.29s =========================

After restoring the original file all four tests go back to PASSED.

Note on scope

This PR is test- and comment-only. No production logic changes. The
invariant is enforced by (a) the comment on _generation_lock that
references #495, and (b) the static checks above. If a future PR
needs to reintroduce a serialized TextModel-direct route, the right
place to do so is alongside an update to these tests that reflects
the new fail-fast contract.

Issue #495 documents a downstream P0 where text-only MLLM requests
bypassed the MLLM scheduler and entered a serialized TextModel-direct
generation path with a 120s wait bound, causing request pileup.

Current main does not have the bug. This pins the invariant so any
future change reintroducing the failure mode breaks the suite.

- Document SimpleEngine._generation_lock as Metal-serialization only,
  with a #495 reference, so future readers see the admission-control
  contract before repurposing the lock.
- Add tests/test_textmodel_direct_admission_invariant.py with five
  static tripwires: no TextModelDirect identifiers, lock comment
  mentions Metal + #495, no asyncio.wait_for(..., timeout>=5s) in
  simple.py, no admission-queue attribute on SimpleEngine, and a
  future-proof shape check that fires the moment a
  text_generation_busy error is added.

All four active tests pass on current main; the fifth skips by
design until the corresponding error class lands.

The tripwires were demoed against an inline-injected reproduction of
the original bug; all four fail with actionable messages, then pass
again once the regression is removed.
@waybarrios
Copy link
Copy Markdown
Owner Author

Hey @Thump604, quick context on what landed here in case it helps when you take a look. The PR is intentionally scoped as test plus comment only, no runtime logic was touched.

On the source side, the only edit is the comment block above self._generation_lock = asyncio.Lock() in vllm_mlx/engine/simple.py. It went from a one-liner to a small block that explicitly says the lock guards Metal command-buffer access only, not request admission, and references issue #495 by number. That comment is part of the contract here, not just docs, because one of the new tests reads it back and asserts both the Metal mention and the #495 reference are present.

The new file is tests/test_textmodel_direct_admission_invariant.py, which adds 5 static tripwires. They check that no TextModelDirect-style identifier lives anywhere in the package, that the lock comment stays documented as Metal-only with the #495 reference, that no asyncio.wait_for(..., timeout>=5s) appears in simple.py, that SimpleEngine carries no admission-queue style attribute, and that if a text_generation_busy symbol ever lands it must surface as HTTP 503 with a machine-readable code. That last one skips today by design since no such symbol exists yet.

The reason for going test-only is that upstream/main doesn't have the bug from #495. The downstream P0 was a serialized TextModel-direct route guarded by a single async lock with a 120s wait bound, and current main routes text-only MLLM through MLLMScheduler, so there's no runtime fix to make here. The goal of the PR is to pin the invariant in negative form, so a future change that quietly reintroduces the failure mode breaks the suite with actionable messages instead of shipping silently.

Happy to grow the scope if you'd rather see the functional half of #495 land in this PR too, meaning a real text_generation_busy 503 exception class plus the /v1/status admission telemetry the issue spells out. Otherwise switching Closes #495 to Refs #495 keeps this PR as the preventive half and leaves the issue open until that infra lands.

@waybarrios waybarrios marked this pull request as draft May 9, 2026 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track admission-control invariant for serialized TextModel-direct routes

1 participant