Fix prompt cache viability by zeljkokalezic · Pull Request #1877 · ikawrakow/ik_llama.cpp

zeljkokalezic · 2026-05-25T21:04:05Z

This takes a different approach than #1854. I addressed the review comments there and expanded the idea to avoid a
brute-force reset path.

I also explored keeping loaded RAM-cache prompts instead of removing them, but abandoned that for this PR because doing it correctly needs lineage/eviction tracking and is not trivial.

Summary

This fixes RAM prompt-cache restore behavior when a server alternates between large prompts that share only a limited reusable prefix.

Previously, the server could select a RAM prompt-cache candidate by fuzzy similarity, load its saved state, and only later discover that the active KV range could not safely rewind to the reusable prefix. That bad cache hit could mutate the slot and then force full prompt reprocessing.

Main changes

Added --cache-ram-reuse-n-min, which controls the minimum reusable common-prefix length required when restoring a RAM prompt-cache candidate.
Saved prompt-cache entries now remember their KV position range: pos_min / pos_max.
RAM cache candidate selection now prefers the longest reusable common prefix, using similarity only as a tie-breaker.
A candidate is rejected before loading state if:
- its shared prefix is too short,
- its reusable fraction is below --cache-ram-similarity,
- its SWA/KV state cannot rewind safely to that prefix.
Context checkpoint eviction now tries to keep the earliest checkpoint when possible, because it can be used as a rewind anchor.
Added a focused test for the prompt rewind viability helper.

Log example:

INFO [              slots_idle] all slots are idle | tid="126489080299520" timestamp=1779742801
======== Prompt cache: cache size: 713, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 20480, f_keep: 0.00, cache_ram_similarity: 0.50
 - looking for better prompt, base f_keep = 0.004, sim = 0.000, lcp = 3, min_reusable_prefix = 20480, min_reusable_fraction = 0.500, n_keep = 0, n_discarded_prompt = 0
 - skipping prompt cache candidate: lcp = 15242, f_keep = 0.182, sim = 0.166, pos_min = 83713, checkpoints = 22, prefix_ok = 0, fraction_ok = 0, rewind_ok = 1
 - found better prompt with f_keep = 0.997, sim = 0.998, lcp = 99630, pos_min = 104387, checkpoints = 34, n_keep = 0, n_discarded_prompt = 0
 - cache state: 1 prompts, 6119.235 MiB (limits: 16384.000 MiB, 0 tokens, 224140 est)
   - prompt 0x7307c95112a0:   83714 tokens,       0 discarded, checkpoints: 22,  6119.235 MiB
prompt cache load took 1727.54 ms
INFO [   launch_slot_with_task] slot is processing task | tid="126489080299520" timestamp=1779742803 id_slot=0 id_task=16446
======== Cache: cache_size = 104388, n_past0 =  88581, n_past1 =  88581, n_past_prompt1 = 88581,  n_past2 =  88581, n_past_prompt2 =  88581

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

firecoperana · 2026-05-26T21:37:35Z

+            return true;
+        }
+        for (const auto & checkpoint : checkpoints) {
+            if (checkpoint.pos_max <= (llama_pos) lcp || checkpoint.pos_max_prompt <= (llama_pos) lcp) {


Only need to check checkpoint.pos_max <= (llama_pos) lcp

firecoperana · 2026-05-26T22:01:03Z

+                lcp_cur.first, f_keep_cur, sim_cur, it->pos_min, it->checkpoints.size(), (int) prefix_ok, (int) fraction_ok, (int) rewind_ok);
+            continue;
+        }
+        if (lcp_best_tokens < lcp_cur.first || (lcp_best_tokens == lcp_cur.first && sim_best < sim_cur)) {


I don't think this is right. You can have two cached prompts. Prompt A has 1000 tokens and 11 common tokens. Prompt B has 100 tokens and 10 common tokens. You will load prompt A and remove 989 tokens. You should load prompt B and removes 90 tokens. similarity is better because it's more balanced.

zeljkokalezic · 2026-05-27T17:59:34Z

Thanks a lot for your review! Addressed both points:

has_rewind_checkpoint() now only checks checkpoint.pos_max <= lcp.
RAM cache candidate selection now uses similarity first again, with LCP only as a tie-breaker

firecoperana · 2026-05-27T21:08:16Z

I think instead of using has_rewind_checkpoint to check if it has checkpoint to rewind, you can return the largest pos<=lcp, and use this number to calculate sim_cur and sim_best. For non recurrent models, it's always lcp, but for recurrent models, you can find that one that requires the least amount of prompt to reprocess. Otherwise it looks good.

zeljkokalezic mentioned this pull request May 25, 2026

Honor minimum common prefix for prompt cache reuse #1854

Closed

zeljkokalezic changed the title ~~Fix swa prompt cache viability~~ Fix prompt cache viability May 25, 2026

zeljkokalezic force-pushed the fix-swa-prompt-cache-viability branch from f986e50 to 6d56573 Compare May 25, 2026 21:17

ikawrakow requested a review from firecoperana May 26, 2026 04:32

localweights mentioned this pull request May 26, 2026

server: preserve primary KV cache when MTP companion trim fails #1889

Closed

firecoperana requested changes May 26, 2026

View reviewed changes

zeljkokalezic added 4 commits May 27, 2026 19:54

Fix SWA prompt cache candidate viability

93f70c8

Document SWA prompt cache restore checks

be71f70

Separate RAM cache restore prefix threshold

1f3ab84

Address prompt cache review feedback

7a72157

zeljkokalezic force-pushed the fix-swa-prompt-cache-viability branch from 6d56573 to 7a72157 Compare May 27, 2026 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix prompt cache viability#1877

Fix prompt cache viability#1877
zeljkokalezic wants to merge 4 commits into
ikawrakow:mainfrom
zeljkokalezic:fix-swa-prompt-cache-viability

zeljkokalezic commented May 25, 2026 •

edited

Loading

Uh oh!

firecoperana May 26, 2026

Uh oh!

firecoperana May 26, 2026

Uh oh!

zeljkokalezic commented May 27, 2026

Uh oh!

firecoperana commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zeljkokalezic commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Main changes

Uh oh!

firecoperana May 26, 2026

Choose a reason for hiding this comment

Uh oh!

firecoperana May 26, 2026

Choose a reason for hiding this comment

Uh oh!

zeljkokalezic commented May 27, 2026

Uh oh!

firecoperana commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zeljkokalezic commented May 25, 2026 •

edited

Loading

firecoperana commented May 27, 2026 •

edited

Loading