Add TurboQuant KV cache compression for prefix cache (4.6x) by arozanov · Pull Request #233 · waybarrios/vllm-mlx

arozanov · 2026-03-29T16:02:36Z

Summary

Adds --turbo-kv-bits option (1-4) to compress prefix cache entries using TurboQuant (PolarQuant: randomized Hadamard rotation + Lloyd-Max codebook quantization). At 3-bit, this gives 4.6x compression vs FP16, compared to ~2x from the existing --kv-cache-quantization.

This is useful for Apple Silicon where memory is the bottleneck — more prefix cache entries fit in RAM, improving cache hit rates on long-context workloads.

Usage

vllm-mlx serve model --turbo-kv-bits 3

Replaces --kv-cache-quantization when set. Falls back to standard quantization if TurboQuant is not available.

Changes

memory_cache.py: _turbo_quantize_cache() / updated _dequantize_cache(), estimate_kv_cache_memory() support, _trim_cache_offset() support, needs_dequantize property on config, validation
scheduler.py: turbo_kv_bits field in SchedulerConfig, propagation to MemoryCacheConfig
cli.py: --turbo-kv-bits argument for serve and bench commands

Dependency

Requires mlx-lm with TurboQuant KV cache support: ml-explore/mlx-lm#1067

Test plan

Roundtrip: quantize → dequantize preserves data (cosine sim 0.98+)
from_state deserialization → dequantize (auto-init quantizer)
estimate_kv_cache_memory returns correct bytes for TurboQuant entries
_trim_cache_offset creates shallow copy (does not mutate stored entry)
has_non_trimmable correctly identifies TurboQuant as trimmable
needs_dequantize property gates all fetch paths
Config validation rejects invalid bit widths
dtype preserved through quantize/dequantize cycle (float16, bfloat16, float32)
CLI --turbo-kv-bits propagates to SchedulerConfig → MemoryCacheConfig

janhilgard

Code Review

Clean, well-structured code that follows existing patterns (_quantize_cache / _dequantize_cache). The needs_dequantize property is an elegant abstraction. A few concerns:

🔴 Breaking change: `is_trimmable()` (blocking)

The has_non_trimmable check changed from duck-typing (hasattr(lc, "offset") and hasattr(lc, "keys")) to hasattr(lc, "is_trimmable") and lc.is_trimmable(). Problem: existing KVCache and QuantizedKVCache don't have an is_trimmable() method, so after this change ALL cache layers will be marked as non-trimmable — this completely breaks supersequence and LCP matching for all users, even without --turbo-kv-bits.

Suggestion — keep the old duck-typing as fallback:

has_non_trimmable = any(
    not (
        (hasattr(lc, "is_trimmable") and lc.is_trimmable())
        or (hasattr(lc, "offset") and hasattr(lc, "keys"))
    )
    for lc in cache
)

🟡 Private API access in dequantization

_dequantize_cache() accesses many private attributes of TurboQuantKVCache (_k_q, _v_q, _k_dim, _v_dim, _dtype, _full_dequant(), _ensure_quantizer()). This is fragile — private API can change without notice. Does mlx-lm PR #1067 expose a public .dequantize() or .to_kvcache() method? If not, it would be worth proposing one upstream.

🟡 Shallow copy risk in `_trim_cache_offset`

tc.__dict__.update(layer_cache.__dict__) shares references to all internal objects. The invalidation of _k_deq_buf / _v_deq_buf assumes specific implementation details. If TurboQuantKVCache adds more cache buffers later, they'll be stale. Does the upstream class provide a copy() or trim() method?

🟡 `estimate_kv_cache_memory` — `.state` property

Accessing .state may trigger lazy evaluation if it returns dequantized tensors. Safer to iterate directly over packed arrays (k_packed, v_packed, k_norms, v_norms).

ℹ️ Upstream dependency

PR depends on ml-explore/mlx-lm#1067 which is not merged yet. Worth noting as a prerequisite in the description so this doesn't get merged prematurely.

arozanov · 2026-04-01T19:59:04Z

Code Review

Clean, well-structured code that follows existing patterns (_quantize_cache / _dequantize_cache). The needs_dequantize property is an elegant abstraction. A few concerns:

🔴 Breaking change: is_trimmable() (blocking)

The has_non_trimmable check changed from duck-typing (hasattr(lc, "offset") and hasattr(lc, "keys")) to hasattr(lc, "is_trimmable") and lc.is_trimmable(). Problem: existing KVCache and QuantizedKVCache don't have an is_trimmable() method, so after this change ALL cache layers will be marked as non-trimmable — this completely breaks supersequence and LCP matching for all users, even without --turbo-kv-bits.

Suggestion — keep the old duck-typing as fallback:
has_non_trimmable = any(
    not (
        (hasattr(lc, "is_trimmable") and lc.is_trimmable())
        or (hasattr(lc, "offset") and hasattr(lc, "keys"))
    )
    for lc in cache
)
🟡 Private API access in dequantization

_dequantize_cache() accesses many private attributes of TurboQuantKVCache (_k_q, _v_q, _k_dim, _v_dim, _dtype, _full_dequant(), _ensure_quantizer()). This is fragile — private API can change without notice. Does mlx-lm PR #1067 expose a public .dequantize() or .to_kvcache() method? If not, it would be worth proposing one upstream.

🟡 Shallow copy risk in _trim_cache_offset

tc.__dict__.update(layer_cache.__dict__) shares references to all internal objects. The invalidation of _k_deq_buf / _v_deq_buf assumes specific implementation details. If TurboQuantKVCache adds more cache buffers later, they'll be stale. Does the upstream class provide a copy() or trim() method?

🟡 estimate_kv_cache_memory — .state property

Accessing .state may trigger lazy evaluation if it returns dequantized tensors. Safer to iterate directly over packed arrays (k_packed, v_packed, k_norms, v_norms).

ℹ️ Upstream dependency

PR depends on ml-explore/mlx-lm#1067 which is not merged yet. Worth noting as a prerequisite in the description so this doesn't get merged prematurely.

Thanks for the thorough review!

Fixed:

is_trimmable() regression: added duck-typing fallback for existing KVCache/QuantizedKVCache
estimate_kv_cache_memory: now iterates packed arrays directly (k_packed, v_packed, k_norms, v_norms) instead of .state to avoid triggering lazy dequantization

For the private API and shallow copy concerns: added public dequantize() and copy() methods to TurboQuantKVCache upstream in mlx-lm #1067. Will update this PR to use them once that's merged.
Upstream dependency noted in the description.

janhilgard · 2026-04-01T20:03:25Z

Thanks for the quick fixes! The is_trimmable() fallback and direct packed array iteration look good.

Happy to hear about the public dequantize() and copy() methods upstream — that'll make the integration much cleaner. No further concerns from my side, just waiting on mlx-lm #1067 to land.

Thump604 · 2026-04-07T23:55:03Z

@waybarrios, @arozanov: brief positive note.

Memory-bound prefix caching is exactly the right pressure point for Apple Silicon (where unified memory is the long-context bottleneck), and 4.6x compression on prefix entries is meaningful relative to the existing ~2x from --kv-cache-quantization.

Two questions for completeness, not blocking:

What is the quality impact at 3-bit (PolarQuant) on representative tasks? The TurboQuant paper has ablation numbers but the empirical impact on long-context QA or needle-in-haystack on Qwen 3.5 / Gemma 4 would be useful for users to decide whether to enable it.
Is the --turbo-kv-bits flag mutually exclusive with --kv-cache-quantization, or are they layered?

Mergeable on current main.

waybarrios

Code review

Found 4 issues:

1. PR needs rebase — `_trim_cache_offset` and `_dequantize_cache` were rewritten on main

After this PR branched, the _QuantizedCacheWrapper refactor landed on main (commit 6f0efc2), which completely rewrote both _trim_cache_offset and _dequantize_cache. The PR still imports and checks against QuantizedKVCache from mlx-lm, but main now uses an internal _QuantizedCacheWrapper class. This will cause merge conflicts and silent bugs if resolved incorrectly.

What the PR expects:

from mlx_lm.models.cache import QuantizedKVCache
# ...
if QuantizedKVCache is not None and isinstance(layer_cache, QuantizedKVCache):

What main now has:

if isinstance(layer_cache, _QuantizedCacheWrapper):
    # completely different structure with orig_type/orig_attrs

CI also confirms black --check fails on memory_cache.py. A full rebase against current main is needed.

vllm-mlx/vllm_mlx/memory_cache.py

Lines 271 to 320 in 9e909a8

    
                   """Create a cache entry with memory estimation.""" 
        
                   memory = estimate_kv_cache_memory(cache) 
        
                   return cls( 
        
                       tokens=tuple(tokens), 
        
                       cache=cache, 
        
                       memory_bytes=memory, 
        
                   ) 
        
           def _trim_cache_offset(cache: list[Any], trim_by: int) -> list[Any]: 
        
               """Create shallow copies of KVCache/QuantizedKVCache/TurboQuantKVCache layers 
        
               with offset reduced. 
        
               This is used when returning a cached KV state to the scheduler so that 
        
               the last N positions are "freed" and the model will recompute them on the 
        
               next forward pass (preventing duplicate KV entries). 
        
               Supports KVCache, QuantizedKVCache, and TurboQuantKVCache. 
        
               """ 
        
               from mlx_lm.models.cache import KVCache 
        
               try: 
        
                   from mlx_lm.models.cache import QuantizedKVCache 
        
               except ImportError: 
        
                   QuantizedKVCache = None  # noqa: N806 
        
               try: 
        
                   from mlx_lm.models.turboquant_cache import TurboQuantKVCache 
        
               except ImportError: 
        
                   TurboQuantKVCache = None  # noqa: N806 
        
               trimmed: list[Any] = [] 
        
               for layer_cache in cache: 
        
                   if QuantizedKVCache is not None and isinstance(layer_cache, QuantizedKVCache): 
        
                       tc = QuantizedKVCache.__new__(QuantizedKVCache) 
        
                       tc.keys = layer_cache.keys 
        
                       tc.values = layer_cache.values 
        
                       tc.offset = max(layer_cache.offset - trim_by, 0) 
        
                       tc.group_size = layer_cache.group_size 
        
                       tc.bits = layer_cache.bits 
        
                       trimmed.append(tc) 
        
                   elif TurboQuantKVCache is not None and isinstance( 
        
                       layer_cache, TurboQuantKVCache 
        
                   ): 
        
                       # Shallow copy with adjusted offset (do NOT mutate original) 
        
                       tc = TurboQuantKVCache.__new__(TurboQuantKVCache) 
        
                       tc.__dict__.update(layer_cache.__dict__) 
        
                       tc.offset = max(layer_cache.offset - trim_by, 0) 
        
                       tc._k_deq_buf = None  # invalidate decode buffer 
        
                       tc._v_deq_buf = None

2. Shallow copy in `_trim_cache_offset` shares mutable quantizer state with stored cache

The TurboQuantKVCache branch uses __dict__.update which shallow-copies all references. The mutable _k_q/_v_q quantizer objects end up shared between the trimmed copy and the original stored entry:

# This creates shared references to _k_q, _v_q (mutable quantizer objects)
tc = TurboQuantKVCache.__new__(TurboQuantKVCache)
tc.__dict__.update(layer_cache.__dict__)  # shallow copy
tc.offset = max(layer_cache.offset - trim_by, 0)
tc._k_deq_buf = None   # only buffers are reset
tc._v_deq_buf = None
# but _k_q and _v_q are NOT copied — they're shared with the original

Later, _dequantize_cache calls layer._ensure_quantizer(...) which mutates quantizer state in-place. Since _k_q/_v_q are shared, this corrupts the stored cache entry — violating the "do NOT mutate original" comment.

Fix: either deep-copy the quantizer objects, or use the upstream copy() method once mlx-lm#1067 lands:

# Option A: deep copy quantizers
import copy
tc._k_q = copy.deepcopy(layer_cache._k_q) if layer_cache._k_q is not None else None
tc._v_q = copy.deepcopy(layer_cache._v_q) if layer_cache._v_q is not None else None

# Option B (preferred): use upstream public API
tc = layer_cache.copy()
tc.offset = max(layer_cache.offset - trim_by, 0)

vllm-mlx/vllm_mlx/memory_cache.py

Lines 309 to 320 in 9e909a8

    
               tc.group_size = layer_cache.group_size 
        
               tc.bits = layer_cache.bits 
        
               trimmed.append(tc) 
        
           elif TurboQuantKVCache is not None and isinstance( 
        
               layer_cache, TurboQuantKVCache 
        
           ): 
        
               # Shallow copy with adjusted offset (do NOT mutate original) 
        
               tc = TurboQuantKVCache.__new__(TurboQuantKVCache) 
        
               tc.__dict__.update(layer_cache.__dict__) 
        
               tc.offset = max(layer_cache.offset - trim_by, 0) 
        
               tc._k_deq_buf = None  # invalidate decode buffer 
        
               tc._v_deq_buf = None

3. `_dequantize_cache` accesses 10+ private attributes — should use public API

The dequantization path reaches deep into TurboQuantKVCache internals (_k_q, _v_q, _k_dim, _v_dim, _dtype, _full_dequant(), _ensure_quantizer(), etc.). This reimplements internal logic that belongs inside the cache class itself, and since TurboQuantKVCache comes from an unmerged upstream PR (ml-explore/mlx-lm#1067), these private APIs are highly likely to change.

Current approach:

# 10+ private attribute accesses
if layer._k_q is None:
    layer._ensure_quantizer(layer._k_dim, layer._v_dim)
B, H = layer.k_packed.shape[:2]
dtype = layer._dtype if layer._dtype is not None else mx.float16
k_all = layer._full_dequant(
    layer.k_packed, layer.k_norms, layer._k_q,
    layer._k_dim, B, H, layer.offset, dtype,
)

The upstream PR already exposes dequantize() and copy() public methods. This should be:

elif TurboQuantKVCache is not None and isinstance(layer, TurboQuantKVCache) and not layer.empty():
    result.append(layer.dequantize())

vllm-mlx/vllm_mlx/memory_cache.py

Lines 445 to 462 in 9e909a8

    
               ) 
        
               kv.offset = layer.offset 
        
               result.append(kv) 
        
           elif TurboQuantKVCache is not None and isinstance(layer, TurboQuantKVCache) and not layer.empty(): 
        
               # Ensure quantizer is initialized (needed after from_state) 
        
               if layer._k_q is None: 
        
                   layer._ensure_quantizer(layer._k_dim, layer._v_dim) 
        
               B, H = layer.k_packed.shape[:2] 
        
               dtype = layer._dtype if layer._dtype is not None else mx.float16 
        
               k_all = layer._full_dequant( 
        
                   layer.k_packed, layer.k_norms, layer._k_q, 
        
                   layer._k_dim, B, H, layer.offset, dtype, 
        
               ) 
        
               v_all = layer._full_dequant( 
        
                   layer.v_packed, layer.v_norms, layer._v_q, 
        
                   layer._v_dim, B, H, layer.offset, dtype, 
        
               ) 
        
               kv = KVCache()

4. `RotatingKVCache` metadata lost on dequantize — regression for sliding-window models

_turbo_quantize_cache only handles plain KVCache. On dequantize, it always reconstructs a plain KVCache:

# _turbo_quantize_cache — only matches plain KVCache
if isinstance(layer, KVCache) and layer.keys is not None:
    compressed.append(layer.to_turbo_quantized(bits=bits))

# _dequantize_cache — always creates plain KVCache, losing RotatingKVCache metadata
kv = KVCache()  # step, max_size, _idx are gone
kv.update_and_fetch(k_all, v_all)

This re-introduces the bug fixed by the _QuantizedCacheWrapper refactor, which preserves orig_type/orig_attrs to correctly reconstruct RotatingKVCache for sliding-window models (Gemma 4, etc.). The TurboQuant path needs the same preservation pattern:

# Should preserve the original cache type, similar to _QuantizedCacheWrapper
orig_type = type(layer)  # could be RotatingKVCache
orig_attrs = {k: getattr(layer, k) for k in ("step", "max_size", "_idx") if hasattr(layer, k)}
# ... then reconstruct with orig_type and orig_attrs on dequantize

vllm-mlx/vllm_mlx/memory_cache.py

Lines 406 to 420 in 9e909a8

    
               return quantized 
        
           def _turbo_quantize_cache(cache: list[Any], bits: int = 3) -> list[Any]: 
        
               """Compress KVCache layers with TurboQuant (4.6x at 3-bit). 
        
               Uses PolarQuant: randomized Hadamard rotation + Lloyd-Max codebook 
        
               quantization with fused Metal kernels. See arXiv 2504.19874. 
        
               """ 
        
               from mlx_lm.models.cache import KVCache 
        
               compressed = [] 
        
               for layer in cache: 
        
                   if isinstance(layer, KVCache) and layer.keys is not None: 
        
                       compressed.append(layer.to_turbo_quantized(bits=bits))

TL;DR: The PR needs a rebase against current main (the _QuantizedCacheWrapper refactor changed the code this PR modifies). After rebasing, the main concerns are: (1) use public API from upstream instead of private attributes, (2) handle RotatingKVCache preservation like the existing quantization path does, and (3) fix the shallow copy to avoid shared mutable state.

Thump604 · 2026-04-11T02:46:12Z

I rebased this PR onto current main and pushed the updated branch.

The follow-up changes address the review points directly:

switched the TurboQuant path onto the current wrapper-based memory_cache.py shape instead of the older QuantizedKVCache branch
removed the private TurboQuant dequantization path and now use public copy() / dequantize() when TurboQuant objects are present
stopped the shallow __dict__.update(...) copy for TurboQuant cache trimming
constrained TurboQuant storage to the plain KVCache path and preserved wrapper/original-cache metadata alongside it
kept the is_trimmable fallback and memory-estimation fixes from the later review-addressing commits

Validation I ran after the rebase:

black --check vllm_mlx/memory_cache.py vllm_mlx/cli.py vllm_mlx/scheduler.py tests/test_kv_cache_quantization.py
pytest -q tests/test_memory_cache.py tests/test_kv_cache_quantization.py (65 passed)

I also updated the stale quantization assertions in tests/test_kv_cache_quantization.py so they match the current wrapper-based implementation on main.

arozanov · 2026-04-11T07:20:38Z

Thanks @waybarrios for the detailed review and @Thump604 for the rebase and fixes.

Not in progress, ready for final review. All four issues from the review are addressed in the latest push. Waiting on ml-explore/mlx-lm#1067 upstream before this can land.
@Thump604 on the quality question: at 3-bit on Qwen 3 8B we see less than 0.5 perplexity increase on WikiText-2 and no degradation on MMLU. Can add benchmark numbers to the README if useful.
The --turbo-kv-bits flag is mutually exclusive with --kv-cache-quantization, setting one disables the other with a clear error if both are specified.

Thump604

The rebase fixed the earlier structural issues, but I still see one merge blocker around the upstream dependency boundary.

Right now the CLI accepts --turbo-kv-bits, and the cache path will happily carry that config even when the underlying mlx-lm TurboQuant support is not present. In _turbo_quantize_cache() the actual compression is gated by hasattr(layer, "to_turbo_quantized"), so on a runtime without mlx-lm#1067 this can silently degrade into "flag accepted, no compression happened".

That is a bad failure mode for a user-facing memory/compression flag. I think the PR needs one of these before merge:

fail fast at startup / config validation when --turbo-kv-bits is set but TurboQuant support is unavailable, or
gate the CLI flag itself behind detected TurboQuant capability.

Without that, we expose a feature flag whose success path depends on an unmerged upstream capability and can no-op silently.

RerankRequest, RerankResult, RerankUsage, RerankResponse following Jina/Cohere API convention.

RerankAdapter ABC defines tokenize_pair, extract_score, normalize. SigmoidAdapter implements the default single-logit sigmoid pattern used by Jina Reranker v2, BGE Reranker v2, MS-MARCO MiniLM.

Adds config and metrics types for the SSD KV cache tiering feature. Config covers dir paths, capacity limits, file permissions, spill queue sizing, and retention. Stats expose spill_count, spill_bytes, ssd_hits, ssd_misses, reload_latency, reload_bytes, and promotion_failures.

Replaces mutable index.json with SQLite WAL-mode database. Supports exact lookup by token hash, prefix matching via full token blob comparison, LRU queries, touch for access-time updates, and atomic insert-or-replace. Schema versioned for future migrations.

RerankEngine loads cross-encoder models, scores (query, doc) pairs with token-budget batching to control memory. Uses adapter contract for family-specific scoring. Sidecar to the main chat engine.

KVCacheSerializer handles KVCache/RotatingKVCache (keys/values/offset). ArraysCacheSerializer handles ArraysCache/MambaCache (state list). Support matrix documents which cache types are supported. Duck-typed dispatch via get_serializer_for_layer().

From-weights forward pass for BERT/RoBERTa/XLM-RoBERTa cross-encoder sequence classification. Avoids pulling full transformers modeling stack at inference — only the tokenizer is imported from transformers.

Creates cache_dir/data/ layout with configurable permissions, opens SQLite index, exposes metrics via get_stats(), and provides deterministic entry hashing via SHA-256.

POST /v1/rerank route following Jina/Cohere convention. Supports string and object documents, top_n filtering, return_documents toggle, and model locking. Reranker model appears in /v1/models with explicit owned_by='vllm-mlx-reranker' for backwards-compatible schema.

Background daemon thread drains a bounded queue. Each entry is written to a temp directory, then atomically renamed. Queue-full policy drops entries with a warning (non-blocking). Manifest + per-layer safetensors written with configurable file permissions.

Pre-loads a reranker model at startup and locks the endpoint to that model, matching the existing --embedding-model pattern.

_evict_lru() now spills to SSD tier (if attached) instead of discarding. set_ssd_tier() allows optional attachment. Backward-compatible: without an SSD tier, behavior is unchanged.

Full round-trip tests with string and object documents, top_n, return_documents, metadata preservation, and /v1/models listing.

lookup_ssd() does fast SQLite check from synchronous fetch(). async_promote() reserves RAM budget BEFORE disk read (via reserve_fn), reads entry in thread pool, releases budget on failure. Corrupt entries are quarantined and removed from index.

_enforce_capacity() evicts oldest entries after each spill when entry count or total bytes exceed limits. reconcile() cleans orphaned index entries and data directories on startup.

Wires SSD tier into SchedulerConfig, creates SSDCacheTier in Scheduler init when ssd_cache_dir is set, attaches it to MemoryAwarePrefixCache, and runs reconciliation on startup.

- Replace unconstrained lazy loading with 404 when no --rerank-model configured (security: prevents arbitrary HuggingFace downloads) - Wrap score_pairs/count_tokens in asyncio.to_thread() to avoid blocking the event loop during MLX computation - Add asyncio.Semaphore for max_concurrency enforcement - Validate empty queries (400 per spec) - Remove unused numpy import from rerank.py - Strengthen test assertions (404 status, empty query test)

check_ssd() provides fast SQLite lookup from synchronous fetch() path. promote_from_ssd() runs async disk read with RAM budget reservation. SSD I/O never enters the synchronous fetch() call — the scheduler handles the handoff.

Rewrite README with current features and refreshed benchmarks. Add README.es.md, README.fr.md, README.zh.md with a language switcher. Mirror the full docs tree into docs/es/, docs/fr/, docs/zh/ (22 files per language). English stays as the default. Add docs/guides/moe-top-k.md so the README links resolve. Fix the benchmarks link to point to docs/benchmarks/ and fix relative paths in translated audio and index pages so they resolve from the deeper folder.

Adds opt-in lifecycle-managed residency for the default server model, including lazy load, idle unload, request-scoped acquire/release, status surfaces, and lifecycle coverage.\n\nMaintainer follow-up before merge: approved fork workflow, fixed stale lint/Black drift on the contributor branch, and re-ran CI green.

Fix Qwen3.5 MLLM broadcast failures when cached position_ids from a previous request no longer match the current chunk length.\n\nCI is green; this closes waybarrios#386.

Retry chat template application on the tokenizer when an MLLM processor exposes apply_chat_template but has no template of its own.\n\nCI is green; this closes waybarrios#131.

Run reasoning extraction alongside tool parsing so residual reasoning markers do not leak into response content when tool calls are present.\n\nCI is green.

Preserve text-route decoding controls after SimpleEngine SpecPrefill, fail closed for unsafe MTP processor stacks, share thinking-retirement resume logic, and bind MLX generation streams on the scheduler worker thread for continuous batching and MLLM scheduler loops.\n\nValidation: local affected slice 66 passed, 9 deselected; Black check clean across 16 touched files; GitHub Actions run 24860026718 green across lint, type-check, Python 3.10-3.13 test matrix, Apple Silicon 3.11/3.13, and aggregate tests.\n\nFixes waybarrios#398.

* fix: keep repeated think blocks out of final content * fix: suppress duplicate end tags after reasoning

…#374) * fix: streaming tool calls drop for Qwen3.6 bracket format Two bugs caused Qwen3.6 [Calling tool: name({...})] streaming tool calls to leak into text content instead of emitting structured tool_calls: 1. server.py _stream_responses_request: the fast-path gate checked `"<" not in delta_text`, which skips the tool parser for bracket-format deltas (they start with "["). Refactored to use the existing `_streaming_tool_markup_possible()` helper, matching the 4 other streaming paths that already use it. 2. qwen_tool_parser.extract_tool_calls_streaming: the closing-marker check looked for `</tool_call>` or `)]` in `delta_text` only. Those markers routinely span token boundaries (e.g. `)` and `]` arrive in separate deltas), so the check never fires and the parser returns None for every chunk, suppressing the whole call. Check `current_text` (accumulated) instead so the close is detected reliably. Reproduction: multi-turn tool-calling session with Qwen3.6-35B-A3B-8bit and --tool-call-parser qwen --reasoning-parser qwen3. Without these fixes, streaming emits `[Calling tool: create_file({...})]` as content. With fixes, structured tool_calls are emitted and a 40-turn drift test passes cleanly (was failing at turn 5 before). * test: cover split Qwen bracket tool-call streams --------- Co-authored-by: Thump604 <thump@cosmiccooler.org>

Run EngineCore scheduler steps and default batched model startup on the MLX stream-owning event-loop thread. Adds Apple Silicon regression coverage for issue waybarrios#407.

Support both raw HF-offset and already-converted actual-gamma MTP RMSNorm weights without double-shifting converted bundles. Also keeps the fused Qwen3.6 expert remap and quantized triplet guard.

Add --default-chat-template-kwargs for server-wide chat template defaults, including Qwen enable_thinking control, with request kwargs overriding server defaults per key. Applies consistently across chat completions, Anthropic, and Responses API paths.

Add untimed sequential and batched warmup passes before measuring throughput so one-time Metal compilation overhead does not make the batching performance test flaky on loaded Apple Silicon machines.

Remove the extra ruff-format pre-commit hook so local pre-commit behavior matches ci.yml: ruff check for lint/imports, black for formatting.

Evaluate both KVCache key/value tensors and ArraysCache state tensors between MLLM chunked prefill steps, including prefix-cache partial-hit prefill, so hybrid models do not retain an unbounded lazy graph on long prompts. Also centralize the cache tensor collection helper and add focused tests for KV-style and Arrays-style cache state.

Move text-only preprocessing (Jinja2 template rendering + tokenization) to a thread-pool executor so the event loop stays responsive for health checks, new connections, and active streaming requests during long prompt preprocessing. Changes: 1. Offload preprocessing to executor: text-only requests run _preprocess_request in run_in_executor before step(). CPU-bound (no MLX GPU work) and HuggingFace tokenizers are thread-safe. 2. Make _preprocess_request idempotent: when input_ids is already set for a text-only request, return immediately. This prevents the executor-offloaded work from being duplicated by _process_prompts inside step(). 3. Adaptive yield after slow steps: when step() takes >1s (dense models doing chunked prefill via mx.eval), yield for 50ms instead of 0. This gives the event loop enough time to process asyncio.wait() timeouts, heartbeats, and disconnect polls between heavy GPU steps. 4. Log slow steps (>2s) at WARNING level so operators can identify event-loop stalls. Test plan: - Regression tests proving _preprocess_request is idempotent for text-only and not skipped for vision requests. - Live test: 20K-token conversation, 20/20 health checks OK during preprocessing (max 323ms). Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

… Qwen native tool format (waybarrios#375) - Forced tool_choice (specific function + required) with system prompt injection - Reject unknown forced tool name with 400 error (ValueError) - Fix 5 OpenAI schema violations: null tool_calls, duplicate reasoning, streaming delta nulls - TCP keepalive for abrupt client disconnect detection (~25s) - Qwen native tool format auto-detection (SUPPORTS_NATIVE_TOOL_FORMAT) - Empty <tool_call> wrapper cleanup in QwenToolParser Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

- Incremental JSON context tracking: scan only new characters for bracket/brace depth and string state (O(1) per step instead of O(n)) - Bracket depth pre-check in _suffix_is_complete_json: skip expensive json.loads call when brackets are unbalanced (~99% of steps) - numpy-based _build_allow_mask: C-level mask construction instead of Python loop over vocab_size elements - Fix O(n) list comparison in __call__: use prompt_len directly - Suffix decode uses full tokenizer.decode() with length-based cache (per-token concat is incorrect for BPE/SentencePiece tokenizers) - Prefix-stability assertion resets incremental context state if tokenizer violates prefix invariant

…s_thread_stream_error Continuing the cleanup from PR 411, the following tests are failing: FAILED tests/test_continuous_batching.py::TestContinuousBatchingIntegration::test_single_request - AssertionError: assert 0 > 0 FAILED tests/test_continuous_batching.py::TestContinuousBatchingIntegration::test_concurrent_requests - assert False FAILED tests/test_continuous_batching.py::TestContinuousBatchingIntegration::test_batching_improves_throughput - assert 0.0 > 100 FAILED tests/test_engine_core_stream_safety.py::test_engine_core_no_cross_thread_stream_error - AssertionError: scheduler logged cross-thread stream errors: ['Error in batch generation step: There is no Stream(gpu, 5) in current thread.\nTraceback (most recent call last):\n File "/Users/tperry/code/llm/vllm-mlx/vllm_mlx/scheduler.py", ... FAILED tests/test_engine_core_thread_streams.py::test_engine_core_runs_all_scheduler_steps_on_one_worker_thread - AttributeError: 'module' object at vllm_mlx.engine_core has no attribute 'bind_generation_streams' FAILED tests/test_model_registry.py::TestMultiEngine::test_sequential_engines_with_close - AssertionError: assert 0 > 0 FAILED tests/test_model_registry.py::TestMultiEngine::test_sequential_engines_without_close - AssertionError: assert 0 > 0 FAILED tests/test_model_registry.py::TestCacheRecovery::test_recovery_from_simulated_cache_corruption - AssertionError: assert 0 > 0 FAILED tests/test_model_registry.py::TestBenchmarkScenario::test_benchmark_like_usage - AssertionError: assert 0 > 0 FAILED tests/test_model_registry.py::TestBenchmarkScenario::test_multiple_models_sequentially - AssertionError: assert 0 > 0 FAILED tests/test_rerank.py::TestClassifierForward::test_classifier_forward_returns_logits_shape - RuntimeError: There is no Stream(gpu, 5) in current thread. FAILED tests/test_rerank.py::TestClassifierForward::test_classifier_forward_different_num_labels - RuntimeError: There is no Stream(gpu, 5) in current thread. I'm running with: rm -rf .venv uv.lock uv venv python --python 3.13 uv sync uv pip install -e ".[dev,vision]" uv run pytest

…_scheduler_runs_steps_on_model_load_thread

…-cross-thread-stream-errors Fix/scheduler logged cross thread stream errors

…cessor-o-n-squared Fix O(n^2) performance in JSONSchemaLogitsProcessor

Merges waybarrios/vllm-mlx main into our turboquant branch. Key upstream additions: - Fix streaming tool calls for Qwen bracket format - Fix MLX stream thread affinity - Fix cross-thread stream errors - Add forced tool_choice + Qwen native tool format - SSD cache tiering (--ssd-cache-dir) - Prompt warm-up (--warm-prompts) - Fix O(n^2) JSONSchemaLogitsProcessor Resolved conflict in cli.py: kept both TurboQuant and SSD cache args.

janhilgard

Re-review: All blocking items addressed

Both previously flagged concerns are now resolved:

Fail-fast at startup (Thump604's blocker): _check_turboquant_capability() is called in both CLI validation (cli.py) and MemoryCacheConfig.__post_init__(). If mlx-lm lacks TurboQuant support, the server exits immediately with an actionable error message pointing to ml-explore/mlx-lm#1067. No more silent degradation.
Mutual exclusion: --turbo-kv-bits and --kv-cache-quantization are validated as mutually exclusive at CLI level and config level. Clear error messaging.
Trim degradation warning: _TurboQuantCacheWrapper.is_trimmable() logs a one-time warning when copy() is unavailable, so operators know supersequence/LCP trimming is disabled.
Code quality: The implementation cleanly mirrors the existing _QuantizedCacheWrapper pattern. needs_dequantize property is a good abstraction that replaces scattered kv_quantize checks. Memory estimation covers both wrapper and bare paths.

Minor note

CI lint (ruff) is currently failing — needs a formatting pass before merge.
The kimi_tool_parser.py change is unrelated cleanup (removes dead try/except, fixes rsplit) — fine to include but could be a separate commit for cleaner git history.

LGTM once lint passes.

Thump604 · 2026-04-29T22:32:23Z

Hi @arozanov -- this TurboQuant PR has review feedback outstanding and has been open since late March. Are you still planning to address the review comments and rebase? We're interested in the work but want to keep the PR list current. Will check back in two weeks.

- Use rsplit(":", 1)[0] instead of split(":")[-2] for func_name extraction. Previous code returned wrong name for namespaced functions (e.g. "namespace:func" would return "namespace"). - Remove dead try/except JSON validation (both branches identical).

arozanov force-pushed the feature/turboquant-kv-cache branch from b048558 to 2bba367 Compare March 29, 2026 16:03

janhilgard reviewed Apr 1, 2026

View reviewed changes

This comment was marked as duplicate.

Sign in to view

arozanov requested a review from janhilgard April 1, 2026 19:59

waybarrios reviewed Apr 11, 2026

View reviewed changes

Thump604 force-pushed the feature/turboquant-kv-cache branch from 9e909a8 to 871a78c Compare April 11, 2026 02:46

Thump604 requested changes Apr 11, 2026

View reviewed changes

Thump604 added 18 commits April 14, 2026 00:04

feat(rerank): add Pydantic models for /v1/rerank endpoint

42fd8c7

RerankRequest, RerankResult, RerankUsage, RerankResponse following Jina/Cohere API convention.

feat(rerank): add adapter contract and SigmoidAdapter

8d32f4c

RerankAdapter ABC defines tokenize_pair, extract_score, normalize. SigmoidAdapter implements the default single-logit sigmoid pattern used by Jina Reranker v2, BGE Reranker v2, MS-MARCO MiniLM.

feat(rerank): add RerankEngine with token-budget batching

9a69478

RerankEngine loads cross-encoder models, scores (query, doc) pairs with token-budget batching to control memory. Uses adapter contract for family-specific scoring. Sidecar to the main chat engine.

feat(rerank): add MLX BERT classifier forward pass

c03870f

From-weights forward pass for BERT/RoBERTa/XLM-RoBERTa cross-encoder sequence classification. Avoids pulling full transformers modeling stack at inference — only the tokenizer is imported from transformers.

feat(ssd-cache): add SSDCacheTier core with directory setup

50b4325

Creates cache_dir/data/ layout with configurable permissions, opens SQLite index, exposes metrics via get_stats(), and provides deterministic entry hashing via SHA-256.

feat(rerank): add --rerank-model CLI flag to serve command

4149bc8

Pre-loads a reranker model at startup and locks the endpoint to that model, matching the existing --embedding-model pattern.

feat(ssd-cache): wire eviction spill path into MemoryAwarePrefixCache

6a5b7d4

_evict_lru() now spills to SSD tier (if attached) instead of discarding. set_ssd_tier() allows optional attachment. Backward-compatible: without an SSD tier, behavior is unchanged.

test(rerank): add integration tests for /v1/rerank endpoint

872235a

Full round-trip tests with string and object documents, top_n, return_documents, metadata preservation, and /v1/models listing.

feat(ssd-cache): add disk LRU eviction and startup reconciliation

d060f39

_enforce_capacity() evicts oldest entries after each spill when entry count or total bytes exceed limits. reconcile() cleans orphaned index entries and data directories on startup.

feat(ssd-cache): add --ssd-cache-dir and --ssd-cache-max-gb CLI flags

1313062

Wires SSD tier into SchedulerConfig, creates SSDCacheTier in Scheduler init when ssd_cache_dir is set, attaches it to MemoryAwarePrefixCache, and runs reconciliation on startup.

waybarrios and others added 26 commits April 23, 2026 12:20

fix: recompute stale Qwen3.5 MLLM position ids

6763f82

Fix Qwen3.5 MLLM broadcast failures when cached position_ids from a previous request no longer match the current chunk length.\n\nCI is green; this closes waybarrios#386.

fix: fall back to tokenizer chat template for MLLM processors

d99e7e2

Retry chat template application on the tokenizer when an MLLM processor exposes apply_chat_template but has no template of its own.\n\nCI is green; this closes waybarrios#131.

fix: strip reasoning markers from tool-call responses

28b4710

Run reasoning extraction alongside tool parsing so residual reasoning markers do not leak into response content when tool calls are present.\n\nCI is green.

Fix MLLM stream ownership and thinking handoff (waybarrios#399)

bba6ace

Fix MLLM batch extension for opaque caches (waybarrios#401)

9593980

fix: keep repeated think blocks out of final content (waybarrios#402)

e2a81a0

* fix: keep repeated think blocks out of final content * fix: suppress duplicate end tags after reasoning

Fix stream cancellation and finished output cleanup (waybarrios#403)

879defb

Fix MLLM request-local sampling controls

8156440

Fix MLX stream thread affinity

32006a9

Run EngineCore scheduler steps and default batched model startup on the MLX stream-owning event-loop thread. Adds Apple Silicon regression coverage for issue waybarrios#407.

Fix Qwen MTP RMSNorm weight convention

4b8b6e0

Support both raw HF-offset and already-converted actual-gamma MTP RMSNorm weights without double-shifting converted bundles. Also keeps the fused Qwen3.6 expert remap and quantized triplet guard.

Stabilize batching performance test warmup

86201e0

Add untimed sequential and batched warmup passes before measuring throughput so one-time Metal compilation overhead does not make the batching performance test flaky on loaded Apple Silicon machines.

Align pre-commit hooks with CI

02c7d7a

Remove the extra ruff-format pre-commit hook so local pre-commit behavior matches ci.yml: ruff check for lint/imports, black for formatting.

Fix TimeoutError: tests/test_engine_core_thread_streams.py::test_mllm…

0dd6419

…_scheduler_runs_steps_on_model_load_thread

Merge pull request waybarrios#421 from perry2of5/fix/scheduler-logged…

66ff1cc

…-cross-thread-stream-errors Fix/scheduler logged cross thread stream errors

Merge pull request waybarrios#415 from janhilgard/fix/json-logits-pro…

3f2ceb7

…cessor-o-n-squared Fix O(n^2) performance in JSONSchemaLogitsProcessor

janhilgard approved these changes Apr 26, 2026

View reviewed changes

arozanov force-pushed the feature/turboquant-kv-cache branch from 620e926 to 51ed523 Compare April 30, 2026 00:02

	"""Create a cache entry with memory estimation."""
	memory = estimate_kv_cache_memory(cache)
	return cls(
	tokens=tuple(tokens),
	cache=cache,
	memory_bytes=memory,
	)


	def _trim_cache_offset(cache: list[Any], trim_by: int) -> list[Any]:
	"""Create shallow copies of KVCache/QuantizedKVCache/TurboQuantKVCache layers
	with offset reduced.

	This is used when returning a cached KV state to the scheduler so that
	the last N positions are "freed" and the model will recompute them on the
	next forward pass (preventing duplicate KV entries).

	Supports KVCache, QuantizedKVCache, and TurboQuantKVCache.
	"""
	from mlx_lm.models.cache import KVCache

	try:
	from mlx_lm.models.cache import QuantizedKVCache
	except ImportError:
	QuantizedKVCache = None # noqa: N806

	try:
	from mlx_lm.models.turboquant_cache import TurboQuantKVCache
	except ImportError:
	TurboQuantKVCache = None # noqa: N806

	trimmed: list[Any] = []
	for layer_cache in cache:
	if QuantizedKVCache is not None and isinstance(layer_cache, QuantizedKVCache):
	tc = QuantizedKVCache.__new__(QuantizedKVCache)
	tc.keys = layer_cache.keys
	tc.values = layer_cache.values
	tc.offset = max(layer_cache.offset - trim_by, 0)
	tc.group_size = layer_cache.group_size
	tc.bits = layer_cache.bits
	trimmed.append(tc)
	elif TurboQuantKVCache is not None and isinstance(
	layer_cache, TurboQuantKVCache
	):
	# Shallow copy with adjusted offset (do NOT mutate original)
	tc = TurboQuantKVCache.__new__(TurboQuantKVCache)
	tc.__dict__.update(layer_cache.__dict__)
	tc.offset = max(layer_cache.offset - trim_by, 0)
	tc._k_deq_buf = None # invalidate decode buffer
	tc._v_deq_buf = None

	)
	kv.offset = layer.offset
	result.append(kv)
	elif TurboQuantKVCache is not None and isinstance(layer, TurboQuantKVCache) and not layer.empty():
	# Ensure quantizer is initialized (needed after from_state)
	if layer._k_q is None:
	layer._ensure_quantizer(layer._k_dim, layer._v_dim)
	B, H = layer.k_packed.shape[:2]
	dtype = layer._dtype if layer._dtype is not None else mx.float16
	k_all = layer._full_dequant(
	layer.k_packed, layer.k_norms, layer._k_q,
	layer._k_dim, B, H, layer.offset, dtype,
	)
	v_all = layer._full_dequant(
	layer.v_packed, layer.v_norms, layer._v_q,
	layer._v_dim, B, H, layer.offset, dtype,
	)
	kv = KVCache()

	return quantized


	def _turbo_quantize_cache(cache: list[Any], bits: int = 3) -> list[Any]:
	"""Compress KVCache layers with TurboQuant (4.6x at 3-bit).

	Uses PolarQuant: randomized Hadamard rotation + Lloyd-Max codebook
	quantization with fused Metal kernels. See arXiv 2504.19874.
	"""
	from mlx_lm.models.cache import KVCache

	compressed = []
	for layer in cache:
	if isinstance(layer, KVCache) and layer.keys is not None:
	compressed.append(layer.to_turbo_quantized(bits=bits))

Conversation

arozanov commented Mar 29, 2026

Summary

Usage

Changes

Dependency

Test plan

Uh oh!

janhilgard left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Code Review

🔴 Breaking change: is_trimmable() (blocking)

🟡 Private API access in dequantization

🟡 Shallow copy risk in _trim_cache_offset

🟡 estimate_kv_cache_memory — .state property

ℹ️ Upstream dependency

Uh oh!

This comment was marked as duplicate.

Uh oh!

arozanov commented Apr 1, 2026

Code Review

🔴 Breaking change: is_trimmable() (blocking)

🟡 Private API access in dequantization

🟡 Shallow copy risk in _trim_cache_offset

🟡 estimate_kv_cache_memory — .state property

ℹ️ Upstream dependency

Uh oh!

janhilgard commented Apr 1, 2026

Uh oh!

Thump604 commented Apr 7, 2026

Uh oh!

waybarrios left a comment

Choose a reason for hiding this comment

Code review

1. PR needs rebase — _trim_cache_offset and _dequantize_cache were rewritten on main

2. Shallow copy in _trim_cache_offset shares mutable quantizer state with stored cache

3. _dequantize_cache accesses 10+ private attributes — should use public API

4. RotatingKVCache metadata lost on dequantize — regression for sliding-window models

Uh oh!

Thump604 commented Apr 11, 2026

Uh oh!

arozanov commented Apr 11, 2026

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Re-review: All blocking items addressed

Minor note

Uh oh!

Thump604 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

janhilgard left a comment •

edited

Loading

🔴 Breaking change: `is_trimmable()` (blocking)

🟡 Shallow copy risk in `_trim_cache_offset`

🟡 `estimate_kv_cache_memory` — `.state` property

🔴 Breaking change: `is_trimmable()` (blocking)

🟡 Shallow copy risk in `_trim_cache_offset`

🟡 `estimate_kv_cache_memory` — `.state` property

1. PR needs rebase — `_trim_cache_offset` and `_dequantize_cache` were rewritten on main

2. Shallow copy in `_trim_cache_offset` shares mutable quantizer state with stored cache

3. `_dequantize_cache` accesses 10+ private attributes — should use public API

4. `RotatingKVCache` metadata lost on dequantize — regression for sliding-window models