Fix prompt cache viability#1877
Conversation
f986e50 to
6d56573
Compare
| return true; | ||
| } | ||
| for (const auto & checkpoint : checkpoints) { | ||
| if (checkpoint.pos_max <= (llama_pos) lcp || checkpoint.pos_max_prompt <= (llama_pos) lcp) { |
There was a problem hiding this comment.
Only need to check checkpoint.pos_max <= (llama_pos) lcp
| lcp_cur.first, f_keep_cur, sim_cur, it->pos_min, it->checkpoints.size(), (int) prefix_ok, (int) fraction_ok, (int) rewind_ok); | ||
| continue; | ||
| } | ||
| if (lcp_best_tokens < lcp_cur.first || (lcp_best_tokens == lcp_cur.first && sim_best < sim_cur)) { |
There was a problem hiding this comment.
I don't think this is right. You can have two cached prompts. Prompt A has 1000 tokens and 11 common tokens. Prompt B has 100 tokens and 10 common tokens. You will load prompt A and remove 989 tokens. You should load prompt B and removes 90 tokens. similarity is better because it's more balanced.
6d56573 to
7a72157
Compare
|
Thanks a lot for your review! Addressed both points:
|
|
I think instead of using |
This takes a different approach than #1854. I addressed the review comments there and expanded the idea to avoid a
brute-force reset path.
I also explored keeping loaded RAM-cache prompts instead of removing them, but abandoned that for this PR because doing it correctly needs lineage/eviction tracking and is not trivial.
Summary
This fixes RAM prompt-cache restore behavior when a server alternates between large prompts that share only a limited reusable prefix.
Previously, the server could select a RAM prompt-cache candidate by fuzzy similarity, load its saved state, and only later discover that the active KV range could not safely rewind to the reusable prefix. That bad cache hit could mutate the slot and then force full prompt reprocessing.
Main changes
--cache-ram-reuse-n-min, which controls the minimum reusable common-prefix length required when restoring a RAM prompt-cache candidate.pos_min/pos_max.--cache-ram-similarity,Log example: