Skip to content

Fix sampling defaults and short prefix-cache reuse#424

Open
Thump604 wants to merge 4 commits intomainfrom
fix/cli-defaults-prefix-cache-floor
Open

Fix sampling defaults and short prefix-cache reuse#424
Thump604 wants to merge 4 commits intomainfrom
fix/cli-defaults-prefix-cache-floor

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

Supersedes #405 because the fork branch cannot be refreshed through the current OAuth token after upstream CI workflow changes landed. This branch is rebased on current main and carries the same scope:

  • propagate CLI/server defaults for top_k, min_p, presence_penalty, and repetition_penalty
  • preserve server default chat-template kwargs from current main
  • avoid reusing short prefix-cache entries that are too small to be useful and can contaminate benchmark runs
  • update focused tests for the current upstream helper/refactor shape

Local validation:

uv run --extra dev pytest -q tests/test_cli.py tests/test_server.py tests/test_server_cache_controls.py tests/test_memory_cache.py tests/test_memory_cache_mlx.py tests/test_prefix_cache.py
# 186 passed, 3 deselected

uv run --extra dev black --check vllm_mlx/cli.py vllm_mlx/server.py tests/test_cli.py tests/test_server.py
uvx ruff check vllm_mlx/cli.py vllm_mlx/server.py tests/test_cli.py tests/test_server.py --select E,F,W --ignore E402,E501,E731,F811,F841
git diff --check
# clean

@Thump604
Copy link
Copy Markdown
Collaborator Author

Would appreciate your review on this when you have a chance. Happy to address any feedback.

@janhilgard
Copy link
Copy Markdown
Collaborator

@Thump604 Good scoping — the _resolve_* pattern is consistent with the existing temperature/top_p resolvers, and the prefix cache floor is a clean addition. Some notes:

Should fix

1. Responses API misses the new sampling defaults

_prepare_responses_request (~line 1413 in server.py) only resolves temperature and top_p. It does not call _resolve_top_k, _resolve_min_p, _resolve_presence_penalty, or _resolve_repetition_penalty. If a user sets CLI defaults for these, they will have no effect on /v1/responses requests. Either wire them in or document the omission as intentional.

2. Verify all Anthropic code paths go through _prepare_anthropic_invocation

The diff updates _prepare_anthropic_invocation, but in earlier versions of the codebase there were separate inline or 0 / or 1.0 patterns in create_anthropic_message and _stream_anthropic_messages. If the base branch hasn't consolidated these, the new resolvers won't cover all Anthropic paths.

Prefix cache floor

3. No way to tune min_prefix_tokens from CLI

The 128-token default is reasonable for chat workloads (system prompts are typically 500-2000 tokens), but for classification or short-prompt use cases it could be too aggressive. A --min-prefix-tokens flag would let operators tune this without code changes — or at minimum log the value at startup so operators know what they're getting.

4. Missing test: LCP guard below floor

test_short_prefix_reuse_is_rejected tests the store rejection and early fetch bail, but not the case where a long token sequence has an LCP match that falls below the 128-token floor (diff line ~428). That's a distinct code path worth covering.

5. Missing test: load_from_disk short-entry skipping

The guard that skips short entries during cache restoration (diff line ~458) has no test coverage. A test that writes a short entry to disk, then loads with a higher floor and verifies it's skipped would close this gap.

Minor

  • test_server.py uses manual try/finally to restore globals — monkeypatch.setattr would be more pytest-idiomatic and auto-cleanup
  • The "miss_short_prefix" and "miss_short_lcp" match types are nice for diagnostics but aren't surfaced in get_stats() as separate counters — they just increment misses. Consider exposing them if you want observability into how often the floor triggers
  • Worth documenting the rationale for 128 as the default (e.g. "below ~128 tokens the KV restore overhead exceeds the prefill savings")

Overall this is clean and the semantics around None vs. falsy (top_k=0 means "user said 0", top_k=None means "use server default") are a real improvement over the old or 0 pattern. The prefix cache floor will help avoid wasting cycles on tiny entries.

Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should fix

  1. Responses API misses new sampling defaults_prepare_responses_request only resolves temperature and top_p, not top_k, min_p, presence_penalty, or repetition_penalty.
  2. Verify all Anthropic code paths go through _prepare_anthropic_invocation — earlier versions had inline or 0 patterns elsewhere.

Prefix cache floor

  1. No CLI flag for min_prefix_tokens — operators can't tune the 128-token default without code changes.
  2. Missing test: LCP match below floor (distinct from early fetch bail).
  3. Missing test: load_from_disk short-entry skipping.

Minor

  • test_server.py uses manual try/finally — monkeypatch.setattr would be more robust
  • "miss_short_prefix" / "miss_short_lcp" match types not exposed in get_stats() counters

Overall clean — the None vs falsy semantics (top_k=0 means "user said 0") are a real improvement.

See full review comment for details.

@Thump604
Copy link
Copy Markdown
Collaborator Author

Jan, assigning this to you for review. It propagates extended sampling defaults (top_k, min_p, presence/repetition penalty) end to end and adds a short-prefix-cache reuse guard. Mergeable and CI green.

Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping. The sampling-default propagation and prefix-cache floor are both clean additions. A few items to address before merge:


Must fix

1. Responses API missing new sampling defaults

_prepare_responses_request (line 1723-1727) builds chat_kwargs with only temperature and top_p resolved. It's missing top_k, min_p, presence_penalty, and repetition_penalty:

# Current (incomplete):
chat_kwargs = {
    "max_tokens": chat_request.max_tokens or _default_max_tokens,
    "temperature": _resolve_temperature(chat_request.temperature),
    "top_p": _resolve_top_p(chat_request.top_p),
}

# Should be:
chat_kwargs = {
    "max_tokens": chat_request.max_tokens or _default_max_tokens,
    "temperature": _resolve_temperature(chat_request.temperature),
    "top_p": _resolve_top_p(chat_request.top_p),
    "top_k": _resolve_top_k(chat_request.top_k),
    "min_p": _resolve_min_p(chat_request.min_p),
    "presence_penalty": _resolve_presence_penalty(chat_request.presence_penalty),
    "repetition_penalty": _resolve_repetition_penalty(chat_request.repetition_penalty),
}

Without this, --default-top-k 20 etc. won't take effect on the /v1/responses endpoint.


Should fix

2. CacheStats doesn't surface short-prefix rejections

miss_short_prefix and miss_short_lcp are tracked as _last_match_type strings but they're counted as generic misses in CacheStats. Adding dedicated counters (or at least including _last_match_type distribution in to_dict()) would help operators understand why cache hit rate dropped after upgrading — they'd see "128 misses, of which 90 were short-prefix rejections" vs. just "128 misses".

3. No CLI flag for min_prefix_tokens

The 128-token default is reasonable, but operators can't tune it without code changes. Adding --prefix-cache-min-tokens (defaulting to 128) would follow the pattern of the other cache knobs like --kv-cache-min-quantize-tokens.


Nits

4. test_server.py try/finally vs monkeypatch

The TestSamplingDefaults tests manually save/restore globals with try/finally. Since the test file already uses pytest, monkeypatch.setattr would be cleaner and safer if a test fails mid-execution:

def test_extended_sampling_defaults(self, monkeypatch):
    monkeypatch.setattr(server, "_default_top_k", 20)
    monkeypatch.setattr(server, "_default_min_p", 0.05)
    ...

5. Squash suggestion

4 commits could be squashed into 1-2 for a cleaner history (one for sampling defaults, one for prefix cache floor), but this is maintainer preference.


What's good

  • The None vs falsy semantics fix is a real improvement — top_k=0 now means "user explicitly requested 0" rather than being silently replaced.
  • Anthropic endpoint now correctly uses _resolve_temperature/_resolve_top_p — this was a bug before.
  • min_prefix_tokens enforcement is thorough: store, fetch (early bail + LCP check), and load_from_disk all covered.

CI is green 9/9. No merge conflicts.

@Thump604
Copy link
Copy Markdown
Collaborator Author

Pushed 832735d addressing the review feedback.

Must fix:

  • Responses API now resolves all four extended sampling defaults (top_k, min_p, presence_penalty, repetition_penalty) in _prepare_responses_request. Both streaming and non-streaming paths use the same chat_kwargs dict so both are covered.

Should fix:

  • Added --prefix-cache-min-tokens CLI flag (serve + bench commands, default 128). Threaded through SchedulerConfig, MLLMSchedulerConfig, and into both MemoryCacheConfig instantiation sites (scheduler.py and mllm_scheduler.py). Logged at startup so operators know the active value.
  • CacheStats now exposes misses_short_prefix and misses_short_lcp as dedicated counters in to_dict(), incremented alongside the existing _last_match_type assignments.

Tests:

  • Added test_lcp_match_below_floor_is_rejected: stores a long entry, queries with a short LCP overlap, verifies miss with misses_short_lcp counter.
  • Added test_load_from_disk_skips_short_entries: creates on-disk cache layout with a 10-token entry, loads with min_prefix_tokens=64, verifies the entry is skipped before load_prompt_cache is called.
  • Converted TestSamplingDefaults from manual try/finally to monkeypatch.setattr.

Validation: 46/46 test_memory_cache.py, 2/2 TestSamplingDefaults, all files pass black --check and py_compile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants