fix(simple): use persistent MLX worker thread to fix thread-local stream crash by xykong · Pull Request #478 · waybarrios/vllm-mlx

xykong · 2026-04-30T18:40:35Z

Summary

Replaces asyncio.to_thread() + bind_generation_streams() in SimpleEngine with a persistent MLXWorkerThread that loads the model and runs all inference on the same OS thread, fixing RuntimeError: There is no Stream(gpu, N) in current thread on MLX >= 0.31.2.

Problem

MLX 0.31.2 (ml-explore/mlx#3348) made Metal CommandEncoder storage thread-local. The current SimpleEngine._run_blocking_serialized() spawns transient threads via asyncio.to_thread() and calls bind_generation_streams() which creates a new stream index via mx.new_stream(). However, the model's compiled Metal kernels reference the original stream index from the loading thread — that index does not exist in the transient worker thread's TLS.

Reproduction: Run vllm-mlx with SimpleEngine on Apple Silicon with MLX >= 0.31.2 and any model that triggers Metal kernel compilation during load (most models). The server crashes on first inference request.

Root Cause

Model loading creates Metal compiled kernels associated with stream index N (created by C++ layer during compilation, not by Python mx.new_stream())
asyncio.to_thread() spawns a new thread each time
bind_generation_streams() calls mx.new_stream() → creates stream M (M ≠ N)
Forward pass references stream N → crash: There is no Stream(gpu, N) in current thread

Fix

Introduce MLXWorkerThread — a persistent single-threaded executor:

Model loads on this thread (stream N created here)
All inference runs on this same thread (stream N always available)
No stream rebinding needed

This matches the proven approach in:

jundot/omlx PR #891 — ThreadPoolExecutor(max_workers=1) with initializer
MLX maintainer recommendation — persistent thread for Metal ops

Changes

vllm_mlx/mlx_streams.py: Added MLXWorkerThread class (persistent daemon thread with queue.SimpleQueue task dispatch). bind_generation_streams() retained for backward compatibility but deprecated.
vllm_mlx/engine/simple.py:
- start() loads model via _mlx_worker.submit() instead of event-loop-thread load
- _run_blocking_serialized() submits to _mlx_worker instead of asyncio.to_thread()
- stream_generate() and stream_chat() MLLM text-only fallback route through worker
- Removed _bind_worker_generation_streams() and run_blocking_startup_work usage
tests/test_mlx_worker_thread.py: New test file verifying thread persistence, FIFO ordering, exception propagation, and MLX Metal ops.

Testing

Tested on Apple M4 Max 128GB, macOS Tahoe 26.3.1, MLX 0.31.2, with mlx-community/gemma-4-26b-a4b-it-4bit:

Metric	Before (crash)	After
Server start	✅	✅
Health check	❌ RuntimeError	✅ 200 OK
Short prompt (tok/s)	N/A	116.0
Medium prompt (tok/s)	N/A	102.1
TTFT (ms)	N/A	8.5–17.6

Related Issues

Closes RuntimeError: There is no Stream(gpu, N) in current thread when calling /v1/messages or /v1/chat/completions on Qwen3.5 MLX models (MLLM path runs in worker thread) #407
Related: ml-explore/mlx#3348 (root cause)
Related: ml-explore/mlx#3216 (discussion)
Related: jundot/omlx#891 (same fix in oMLX)

…eam crash MLX >= 0.31.2 (PR #3348) made CommandEncoders thread-local. The previous approach of spawning transient threads via asyncio.to_thread() and calling bind_generation_streams() fails because: 1. Model loading creates Metal compiled kernels with a stream index (N) bound to the loading thread's TLS 2. Transient worker threads create new stream indices via mx.new_stream() which do NOT include the model's original stream N 3. RuntimeError: There is no Stream(gpu, N) in current thread Fix: Replace asyncio.to_thread with a persistent MLXWorkerThread that: - Runs a single daemon thread for the lifetime of the process - Loads the model AND runs all inference on the same thread - Guarantees thread-local stream consistency This matches the approach used by: - oMLX (jundot/omlx PR #891) - MLX maintainer recommendations (ml-explore/mlx #3216) Tested on Apple M4 Max with MLX 0.31.2, Gemma 4 26B.

janhilgard

Good problem analysis and clean MLXWorkerThread implementation. The root cause (MLX 0.31.2 thread-local CommandEncoders) is real, and a persistent worker thread is the right approach.

However, this PR introduces a critical streaming regression and has a few additional concerns:

1. Critical: `stream_generate` and `stream_chat` text-only fallback lost true streaming

The old code yielded chunks as they were generated (true token-by-token streaming):

async with self._generation_lock:
    for chunk in self._model.stream_generate(...):
        yield GenerationOutput(...)  # client sees each token immediately

The new code collects ALL chunks into a list first, then yields them after generation is complete:

def _run_stream_generate():
    results = []
    for chunk in self._model.stream_generate(...):
        results.append(chunk)
    return results

chunks = await self._run_blocking_serialized(_run_stream_generate)
for chunk in chunks:
    yield GenerationOutput(...)  # client sees everything only after full generation

This means TTFT = total generation time — for a 500-token response at 100 tok/s, the client waits 5 seconds before seeing any output instead of seeing the first token in ~50ms. This is a major UX regression for any SSE/streaming client.

The fix would be to use an asyncio.Queue or similar mechanism to bridge the synchronous generator on the worker thread with the async generator on the event loop, yielding chunks as they're produced.

2. Module-level singleton starts thread at import time

_mlx_worker = MLXWorkerThread()

This starts a daemon thread the moment simple.py is imported, even if SimpleEngine is never instantiated (e.g., when using BatchedEngine or MllmBatchGenerator). Consider lazy initialization or moving the worker into the SimpleEngine instance.

3. Missing `generation_stream` rebinding on worker thread

The old code called bind_generation_streams() before each inference to rebind mlx_lm.generate.generation_stream and mlx_vlm.generate.generation_stream to the current thread. The new code skips this entirely.

These module-level generation_stream variables are created at import time (main thread). If mlx_lm internally uses mx.stream(generation_stream) to schedule Metal operations, those operations reference a stream whose CommandEncoder lives in the main thread's TLS — which would crash on the worker thread.

This may work in your test because either (a) the default stream is special, or (b) mlx_lm doesn't reference generation_stream in the code path you tested. But it should be verified more broadly, or a single bind_generation_streams() call should be added to the worker thread's init.

4. Minor: `asyncio.ensure_future` on a Future is a no-op

task = asyncio.ensure_future(
    _mlx_worker.submit(loop, func, *args, **kwargs)
)

worker.submit() already returns an asyncio.Future. ensure_future on a Future just returns it. The variable name task is misleading since it's a Future, not a Task. asyncio.shield() works on both, so it's functionally correct but confusing.

Tests

The new tests are well-structured and cover the right cases (thread persistence, FIFO ordering, exception propagation, MLX ops). No issues there.

Summary

The MLXWorkerThread itself is solid. The main blocker is the streaming regression — stream_generate and stream_chat must preserve true token-by-token streaming, not batch-then-yield.

Replaces asyncio.to_thread() + bind_generation_streams() in SimpleEngine with a persistent MLXWorkerThread that loads the model and runs all inference on the same OS thread, fixing 'RuntimeError: There is no Stream(gpu, N) in current thread' on MLX >= 0.31.2 (which made Metal CommandEncoder storage thread-local). Known issues per reviewer feedback (waybarrios#478): - Streaming regression: stream_generate/stream_chat collect all chunks before yielding (TTFT = total generation time). Smoke tests use non-streaming so this doesn't block initial DeepSeek-V4-Flash bring-up, but must be fixed before relying on SSE clients. - Module-level singleton starts a daemon thread at import. - generation_stream rebinding on worker thread is dropped (may break paths that explicitly reference module-level generation_stream). Pulled from xykong/vllm-mlx@fix/persistent-mlx-worker-thread.

janhilgard requested changes May 1, 2026

View reviewed changes

This was referenced May 1, 2026

fix: run BatchedEngine MLLM on dedicated MLXWorkerThread to prevent cross-thread stream errors #479

Open

docs: clarify MLLM MTP guard semantics #491

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(simple): use persistent MLX worker thread to fix thread-local stream crash#478

fix(simple): use persistent MLX worker thread to fix thread-local stream crash#478
xykong wants to merge 1 commit intowaybarrios:mainfrom
xykong:fix/persistent-mlx-worker-thread

xykong commented Apr 30, 2026

Uh oh!

janhilgard left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xykong commented Apr 30, 2026

Summary

Problem

Root Cause

Fix

Changes

Testing

Related Issues

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

1. Critical: stream_generate and stream_chat text-only fallback lost true streaming

2. Module-level singleton starts thread at import time

3. Missing generation_stream rebinding on worker thread

4. Minor: asyncio.ensure_future on a Future is a no-op

Tests

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Critical: `stream_generate` and `stream_chat` text-only fallback lost true streaming

3. Missing `generation_stream` rebinding on worker thread

4. Minor: `asyncio.ensure_future` on a Future is a no-op