Fix sampling defaults and short prefix-cache reuse#424
Fix sampling defaults and short prefix-cache reuse#424
Conversation
|
Would appreciate your review on this when you have a chance. Happy to address any feedback. |
|
@Thump604 Good scoping — the Should fix1. Responses API misses the new sampling defaults
2. Verify all Anthropic code paths go through The diff updates Prefix cache floor3. No way to tune The 128-token default is reasonable for chat workloads (system prompts are typically 500-2000 tokens), but for classification or short-prompt use cases it could be too aggressive. A 4. Missing test: LCP guard below floor
5. Missing test: The guard that skips short entries during cache restoration (diff line ~458) has no test coverage. A test that writes a short entry to disk, then loads with a higher floor and verifies it's skipped would close this gap. Minor
Overall this is clean and the semantics around |
janhilgard
left a comment
There was a problem hiding this comment.
Should fix
- Responses API misses new sampling defaults —
_prepare_responses_requestonly resolvestemperatureandtop_p, nottop_k,min_p,presence_penalty, orrepetition_penalty. - Verify all Anthropic code paths go through
_prepare_anthropic_invocation— earlier versions had inlineor 0patterns elsewhere.
Prefix cache floor
- No CLI flag for
min_prefix_tokens— operators can't tune the 128-token default without code changes. - Missing test: LCP match below floor (distinct from early fetch bail).
- Missing test:
load_from_diskshort-entry skipping.
Minor
test_server.pyuses manual try/finally —monkeypatch.setattrwould be more robust"miss_short_prefix"/"miss_short_lcp"match types not exposed inget_stats()counters
Overall clean — the None vs falsy semantics (top_k=0 means "user said 0") are a real improvement.
See full review comment for details.
|
Jan, assigning this to you for review. It propagates extended sampling defaults (top_k, min_p, presence/repetition penalty) end to end and adds a short-prefix-cache reuse guard. Mergeable and CI green. |
janhilgard
left a comment
There was a problem hiding this comment.
Thanks for the ping. The sampling-default propagation and prefix-cache floor are both clean additions. A few items to address before merge:
Must fix
1. Responses API missing new sampling defaults
_prepare_responses_request (line 1723-1727) builds chat_kwargs with only temperature and top_p resolved. It's missing top_k, min_p, presence_penalty, and repetition_penalty:
# Current (incomplete):
chat_kwargs = {
"max_tokens": chat_request.max_tokens or _default_max_tokens,
"temperature": _resolve_temperature(chat_request.temperature),
"top_p": _resolve_top_p(chat_request.top_p),
}
# Should be:
chat_kwargs = {
"max_tokens": chat_request.max_tokens or _default_max_tokens,
"temperature": _resolve_temperature(chat_request.temperature),
"top_p": _resolve_top_p(chat_request.top_p),
"top_k": _resolve_top_k(chat_request.top_k),
"min_p": _resolve_min_p(chat_request.min_p),
"presence_penalty": _resolve_presence_penalty(chat_request.presence_penalty),
"repetition_penalty": _resolve_repetition_penalty(chat_request.repetition_penalty),
}Without this, --default-top-k 20 etc. won't take effect on the /v1/responses endpoint.
Should fix
2. CacheStats doesn't surface short-prefix rejections
miss_short_prefix and miss_short_lcp are tracked as _last_match_type strings but they're counted as generic misses in CacheStats. Adding dedicated counters (or at least including _last_match_type distribution in to_dict()) would help operators understand why cache hit rate dropped after upgrading — they'd see "128 misses, of which 90 were short-prefix rejections" vs. just "128 misses".
3. No CLI flag for min_prefix_tokens
The 128-token default is reasonable, but operators can't tune it without code changes. Adding --prefix-cache-min-tokens (defaulting to 128) would follow the pattern of the other cache knobs like --kv-cache-min-quantize-tokens.
Nits
4. test_server.py try/finally vs monkeypatch
The TestSamplingDefaults tests manually save/restore globals with try/finally. Since the test file already uses pytest, monkeypatch.setattr would be cleaner and safer if a test fails mid-execution:
def test_extended_sampling_defaults(self, monkeypatch):
monkeypatch.setattr(server, "_default_top_k", 20)
monkeypatch.setattr(server, "_default_min_p", 0.05)
...5. Squash suggestion
4 commits could be squashed into 1-2 for a cleaner history (one for sampling defaults, one for prefix cache floor), but this is maintainer preference.
What's good
- The
Nonevs falsy semantics fix is a real improvement —top_k=0now means "user explicitly requested 0" rather than being silently replaced. - Anthropic endpoint now correctly uses
_resolve_temperature/_resolve_top_p— this was a bug before. min_prefix_tokensenforcement is thorough: store, fetch (early bail + LCP check), and load_from_disk all covered.
CI is green 9/9. No merge conflicts.
|
Pushed 832735d addressing the review feedback. Must fix:
Should fix:
Tests:
Validation: 46/46 |
Supersedes #405 because the fork branch cannot be refreshed through the current OAuth token after upstream CI workflow changes landed. This branch is rebased on current
mainand carries the same scope:top_k,min_p,presence_penalty, andrepetition_penaltymainLocal validation: