Skip to content

Fix prompt cache viability#1877

Open
zeljkokalezic wants to merge 4 commits into
ikawrakow:mainfrom
zeljkokalezic:fix-swa-prompt-cache-viability
Open

Fix prompt cache viability#1877
zeljkokalezic wants to merge 4 commits into
ikawrakow:mainfrom
zeljkokalezic:fix-swa-prompt-cache-viability

Conversation

@zeljkokalezic

@zeljkokalezic zeljkokalezic commented May 25, 2026

Copy link
Copy Markdown

This takes a different approach than #1854. I addressed the review comments there and expanded the idea to avoid a
brute-force reset path.

I also explored keeping loaded RAM-cache prompts instead of removing them, but abandoned that for this PR because doing it correctly needs lineage/eviction tracking and is not trivial.

Summary

This fixes RAM prompt-cache restore behavior when a server alternates between large prompts that share only a limited reusable prefix.

Previously, the server could select a RAM prompt-cache candidate by fuzzy similarity, load its saved state, and only later discover that the active KV range could not safely rewind to the reusable prefix. That bad cache hit could mutate the slot and then force full prompt reprocessing.

Main changes

  • Added --cache-ram-reuse-n-min, which controls the minimum reusable common-prefix length required when restoring a RAM prompt-cache candidate.
  • Saved prompt-cache entries now remember their KV position range: pos_min / pos_max.
  • RAM cache candidate selection now prefers the longest reusable common prefix, using similarity only as a tie-breaker.
  • A candidate is rejected before loading state if:
    • its shared prefix is too short,
    • its reusable fraction is below --cache-ram-similarity,
    • its SWA/KV state cannot rewind safely to that prefix.
  • Context checkpoint eviction now tries to keep the earliest checkpoint when possible, because it can be used as a rewind anchor.
  • Added a focused test for the prompt rewind viability helper.

Log example:

INFO [              slots_idle] all slots are idle | tid="126489080299520" timestamp=1779742801
======== Prompt cache: cache size: 713, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 20480, f_keep: 0.00, cache_ram_similarity: 0.50
 - looking for better prompt, base f_keep = 0.004, sim = 0.000, lcp = 3, min_reusable_prefix = 20480, min_reusable_fraction = 0.500, n_keep = 0, n_discarded_prompt = 0
 - skipping prompt cache candidate: lcp = 15242, f_keep = 0.182, sim = 0.166, pos_min = 83713, checkpoints = 22, prefix_ok = 0, fraction_ok = 0, rewind_ok = 1
 - found better prompt with f_keep = 0.997, sim = 0.998, lcp = 99630, pos_min = 104387, checkpoints = 34, n_keep = 0, n_discarded_prompt = 0
 - cache state: 1 prompts, 6119.235 MiB (limits: 16384.000 MiB, 0 tokens, 224140 est)
   - prompt 0x7307c95112a0:   83714 tokens,       0 discarded, checkpoints: 22,  6119.235 MiB
prompt cache load took 1727.54 ms
INFO [   launch_slot_with_task] slot is processing task | tid="126489080299520" timestamp=1779742803 id_slot=0 id_task=16446
======== Cache: cache_size = 104388, n_past0 =  88581, n_past1 =  88581, n_past_prompt1 = 88581,  n_past2 =  88581, n_past_prompt2 =  88581

@zeljkokalezic zeljkokalezic changed the title Fix swa prompt cache viability Fix prompt cache viability May 25, 2026
@zeljkokalezic zeljkokalezic force-pushed the fix-swa-prompt-cache-viability branch from f986e50 to 6d56573 Compare May 25, 2026 21:17
@ikawrakow ikawrakow requested a review from firecoperana May 26, 2026 04:32
Comment thread examples/server/server-task.h Outdated
return true;
}
for (const auto & checkpoint : checkpoints) {
if (checkpoint.pos_max <= (llama_pos) lcp || checkpoint.pos_max_prompt <= (llama_pos) lcp) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only need to check checkpoint.pos_max <= (llama_pos) lcp

Comment thread examples/server/server-task.cpp Outdated
lcp_cur.first, f_keep_cur, sim_cur, it->pos_min, it->checkpoints.size(), (int) prefix_ok, (int) fraction_ok, (int) rewind_ok);
continue;
}
if (lcp_best_tokens < lcp_cur.first || (lcp_best_tokens == lcp_cur.first && sim_best < sim_cur)) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right. You can have two cached prompts. Prompt A has 1000 tokens and 11 common tokens. Prompt B has 100 tokens and 10 common tokens. You will load prompt A and remove 989 tokens. You should load prompt B and removes 90 tokens. similarity is better because it's more balanced.

@zeljkokalezic zeljkokalezic force-pushed the fix-swa-prompt-cache-viability branch from 6d56573 to 7a72157 Compare May 27, 2026 17:58
@zeljkokalezic

Copy link
Copy Markdown
Author

Thanks a lot for your review! Addressed both points:

  • has_rewind_checkpoint() now only checks checkpoint.pos_max <= lcp.
  • RAM cache candidate selection now uses similarity first again, with LCP only as a tie-breaker

@firecoperana

firecoperana commented May 27, 2026

Copy link
Copy Markdown
Collaborator

I think instead of using has_rewind_checkpoint to check if it has checkpoint to rewind, you can return the largest pos<=lcp, and use this number to calculate sim_cur and sim_best. For non recurrent models, it's always lcp, but for recurrent models, you can find that one that requires the least amount of prompt to reprocess. Otherwise it looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants