Fix sampling defaults and short prefix-cache reuse by Thump604 · Pull Request #424 · waybarrios/vllm-mlx

Thump604 · 2026-04-24T19:19:02Z

Supersedes #405 because the fork branch cannot be refreshed through the current OAuth token after upstream CI workflow changes landed. This branch is rebased on current main and carries the same scope:

propagate CLI/server defaults for top_k, min_p, presence_penalty, and repetition_penalty
preserve server default chat-template kwargs from current main
avoid reusing short prefix-cache entries that are too small to be useful and can contaminate benchmark runs
update focused tests for the current upstream helper/refactor shape

Local validation:

uv run --extra dev pytest -q tests/test_cli.py tests/test_server.py tests/test_server_cache_controls.py tests/test_memory_cache.py tests/test_memory_cache_mlx.py tests/test_prefix_cache.py
# 186 passed, 3 deselected

uv run --extra dev black --check vllm_mlx/cli.py vllm_mlx/server.py tests/test_cli.py tests/test_server.py
uvx ruff check vllm_mlx/cli.py vllm_mlx/server.py tests/test_cli.py tests/test_server.py --select E,F,W --ignore E402,E501,E731,F811,F841
git diff --check
# clean

Thump604 · 2026-04-25T14:15:39Z

Would appreciate your review on this when you have a chance. Happy to address any feedback.

janhilgard · 2026-04-25T16:06:52Z

@Thump604 Good scoping — the _resolve_* pattern is consistent with the existing temperature/top_p resolvers, and the prefix cache floor is a clean addition. Some notes:

Should fix

1. Responses API misses the new sampling defaults

_prepare_responses_request (~line 1413 in server.py) only resolves temperature and top_p. It does not call _resolve_top_k, _resolve_min_p, _resolve_presence_penalty, or _resolve_repetition_penalty. If a user sets CLI defaults for these, they will have no effect on /v1/responses requests. Either wire them in or document the omission as intentional.

2. Verify all Anthropic code paths go through _prepare_anthropic_invocation

The diff updates _prepare_anthropic_invocation, but in earlier versions of the codebase there were separate inline or 0 / or 1.0 patterns in create_anthropic_message and _stream_anthropic_messages. If the base branch hasn't consolidated these, the new resolvers won't cover all Anthropic paths.

Prefix cache floor

3. No way to tune min_prefix_tokens from CLI

The 128-token default is reasonable for chat workloads (system prompts are typically 500-2000 tokens), but for classification or short-prompt use cases it could be too aggressive. A --min-prefix-tokens flag would let operators tune this without code changes — or at minimum log the value at startup so operators know what they're getting.

4. Missing test: LCP guard below floor

test_short_prefix_reuse_is_rejected tests the store rejection and early fetch bail, but not the case where a long token sequence has an LCP match that falls below the 128-token floor (diff line ~428). That's a distinct code path worth covering.

5. Missing test: load_from_disk short-entry skipping

The guard that skips short entries during cache restoration (diff line ~458) has no test coverage. A test that writes a short entry to disk, then loads with a higher floor and verifies it's skipped would close this gap.

Minor

test_server.py uses manual try/finally to restore globals — monkeypatch.setattr would be more pytest-idiomatic and auto-cleanup
The "miss_short_prefix" and "miss_short_lcp" match types are nice for diagnostics but aren't surfaced in get_stats() as separate counters — they just increment misses. Consider exposing them if you want observability into how often the floor triggers
Worth documenting the rationale for 128 as the default (e.g. "below ~128 tokens the KV restore overhead exceeds the prefill savings")

Overall this is clean and the semantics around None vs. falsy (top_k=0 means "user said 0", top_k=None means "use server default") are a real improvement over the old or 0 pattern. The prefix cache floor will help avoid wasting cycles on tiny entries.

janhilgard

Should fix

Responses API misses new sampling defaults — _prepare_responses_request only resolves temperature and top_p, not top_k, min_p, presence_penalty, or repetition_penalty.
Verify all Anthropic code paths go through _prepare_anthropic_invocation — earlier versions had inline or 0 patterns elsewhere.

Prefix cache floor

No CLI flag for min_prefix_tokens — operators can't tune the 128-token default without code changes.
Missing test: LCP match below floor (distinct from early fetch bail).
Missing test: load_from_disk short-entry skipping.

Minor

test_server.py uses manual try/finally — monkeypatch.setattr would be more robust
"miss_short_prefix" / "miss_short_lcp" match types not exposed in get_stats() counters

Overall clean — the None vs falsy semantics (top_k=0 means "user said 0") are a real improvement.

See full review comment for details.

Thump604 · 2026-04-29T22:05:07Z

Jan, assigning this to you for review. It propagates extended sampling defaults (top_k, min_p, presence/repetition penalty) end to end and adds a short-prefix-cache reuse guard. Mergeable and CI green.

janhilgard

Thanks for the ping. The sampling-default propagation and prefix-cache floor are both clean additions. A few items to address before merge:

Must fix

1. Responses API missing new sampling defaults

_prepare_responses_request (line 1723-1727) builds chat_kwargs with only temperature and top_p resolved. It's missing top_k, min_p, presence_penalty, and repetition_penalty:

# Current (incomplete):
chat_kwargs = {
    "max_tokens": chat_request.max_tokens or _default_max_tokens,
    "temperature": _resolve_temperature(chat_request.temperature),
    "top_p": _resolve_top_p(chat_request.top_p),
}

# Should be:
chat_kwargs = {
    "max_tokens": chat_request.max_tokens or _default_max_tokens,
    "temperature": _resolve_temperature(chat_request.temperature),
    "top_p": _resolve_top_p(chat_request.top_p),
    "top_k": _resolve_top_k(chat_request.top_k),
    "min_p": _resolve_min_p(chat_request.min_p),
    "presence_penalty": _resolve_presence_penalty(chat_request.presence_penalty),
    "repetition_penalty": _resolve_repetition_penalty(chat_request.repetition_penalty),
}

Without this, --default-top-k 20 etc. won't take effect on the /v1/responses endpoint.

Should fix

2. CacheStats doesn't surface short-prefix rejections

miss_short_prefix and miss_short_lcp are tracked as _last_match_type strings but they're counted as generic misses in CacheStats. Adding dedicated counters (or at least including _last_match_type distribution in to_dict()) would help operators understand why cache hit rate dropped after upgrading — they'd see "128 misses, of which 90 were short-prefix rejections" vs. just "128 misses".

3. No CLI flag for min_prefix_tokens

The 128-token default is reasonable, but operators can't tune it without code changes. Adding --prefix-cache-min-tokens (defaulting to 128) would follow the pattern of the other cache knobs like --kv-cache-min-quantize-tokens.

Nits

4. test_server.py try/finally vs monkeypatch

The TestSamplingDefaults tests manually save/restore globals with try/finally. Since the test file already uses pytest, monkeypatch.setattr would be cleaner and safer if a test fails mid-execution:

def test_extended_sampling_defaults(self, monkeypatch):
    monkeypatch.setattr(server, "_default_top_k", 20)
    monkeypatch.setattr(server, "_default_min_p", 0.05)
    ...

5. Squash suggestion

4 commits could be squashed into 1-2 for a cleaner history (one for sampling defaults, one for prefix cache floor), but this is maintainer preference.

What's good

The None vs falsy semantics fix is a real improvement — top_k=0 now means "user explicitly requested 0" rather than being silently replaced.
Anthropic endpoint now correctly uses _resolve_temperature/_resolve_top_p — this was a bug before.
min_prefix_tokens enforcement is thorough: store, fetch (early bail + LCP check), and load_from_disk all covered.

CI is green 9/9. No merge conflicts.

Thump604 · 2026-04-30T00:49:52Z

Pushed 832735d addressing the review feedback.

Must fix:

Responses API now resolves all four extended sampling defaults (top_k, min_p, presence_penalty, repetition_penalty) in _prepare_responses_request. Both streaming and non-streaming paths use the same chat_kwargs dict so both are covered.

Should fix:

Added --prefix-cache-min-tokens CLI flag (serve + bench commands, default 128). Threaded through SchedulerConfig, MLLMSchedulerConfig, and into both MemoryCacheConfig instantiation sites (scheduler.py and mllm_scheduler.py). Logged at startup so operators know the active value.
CacheStats now exposes misses_short_prefix and misses_short_lcp as dedicated counters in to_dict(), incremented alongside the existing _last_match_type assignments.

Tests:

Added test_lcp_match_below_floor_is_rejected: stores a long entry, queries with a short LCP overlap, verifies miss with misses_short_lcp counter.
Added test_load_from_disk_skips_short_entries: creates on-disk cache layout with a 10-token entry, loads with min_prefix_tokens=64, verifies the entry is skipped before load_prompt_cache is called.
Converted TestSamplingDefaults from manual try/finally to monkeypatch.setattr.

Validation: 46/46 test_memory_cache.py, 2/2 TestSamplingDefaults, all files pass black --check and py_compile.

Thump604 added 4 commits April 24, 2026 10:07

Propagate CLI sampling defaults

da59bae

Gate prefix cache reuse on meaningful prefixes

0c4d316

Propagate extended server sampling defaults

c98f44a

Adjust MLX cache tests for prefix reuse floor

e798b50

Thump604 mentioned this pull request Apr 24, 2026

Add bench-serve workload contracts #406

Closed

Thump604 requested a review from janhilgard April 25, 2026 14:15

janhilgard reviewed Apr 25, 2026

View reviewed changes

Thump604 assigned janhilgard Apr 29, 2026

janhilgard reviewed Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sampling defaults and short prefix-cache reuse#424

Fix sampling defaults and short prefix-cache reuse#424
Thump604 wants to merge 4 commits intomainfrom
fix/cli-defaults-prefix-cache-floor

Thump604 commented Apr 24, 2026

Uh oh!

Thump604 commented Apr 25, 2026

Uh oh!

janhilgard commented Apr 25, 2026

Uh oh!

janhilgard left a comment

Uh oh!

Thump604 commented Apr 29, 2026

Uh oh!

janhilgard left a comment

Uh oh!

Thump604 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Thump604 commented Apr 24, 2026

Uh oh!

Thump604 commented Apr 25, 2026

Uh oh!

janhilgard commented Apr 25, 2026

Should fix

Prefix cache floor

Minor

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Should fix

Prefix cache floor

Minor

Uh oh!

Thump604 commented Apr 29, 2026

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Must fix

Should fix

Nits

What's good

Uh oh!

Thump604 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants