server: preserve primary KV cache when MTP companion trim fails#1889
Closed
localweights wants to merge 1 commit into
Closed
server: preserve primary KV cache when MTP companion trim fails#1889localweights wants to merge 1 commit into
localweights wants to merge 1 commit into
Conversation
The pre-batch reset path in server-context partially trims both the target ctx and (if MTP is enabled) its speculative companion ctx at p0 = system + n_past. Either failure currently triggers a full reset that nukes cache_tokens, slot.n_past, n_prompt_tokens_cache, the checkpoint list, and the sampler state. Companion failures are common after generation because unvalidated draft tokens leave the companion KV's position layout out of sync with the primary's. Sacrificing the primary cache for that recoverable mismatch forces a full re-prefill on the next request, even though the primary KV trim succeeded. This change splits the fallback: when only the companion fails, wipe just the companion (it repopulates during the next prefill) and keep the primary cache + checkpoints intact. The full-reset path remains in place for when the primary itself fails to trim (non-Transformer fall-through case the comment alludes to). Validated on Qwen3.6-27B + --multi-token-prediction --draft-max 3: 92% prefix-cache reuse on multi-pass synthesis vs 0% before this change.
Owner
|
Can you provide a reproduction where trimming one context succeeds but trimming the other fails? |
Owner
|
Add an issue with reproduction. After that you can resubmit the PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The pre-batch reset path in
server_contextpartially trims both the target ctx and (when MTP is enabled) its speculative companion ctx atp0 = system + n_past. The existing logic treats either trim failure as a reason to nukecache_tokens,slot.n_past,n_prompt_tokens_cache, the checkpoint list, and reset the sampler.Companion failures happen routinely after generation: unvalidated draft tokens leave the companion KV's position layout out of sync with primary. Sacrificing the primary cache for a recoverable mismatch confined to the draft ctx forces a full re-prefill on the next request, defeating the entire point of prefix caching when MTP is on.
Change
Split the fallback into two paths:
target_trimmed && !companion_trimmed→ wipe only the companion (it repopulates during the next prefill); leave the primary cache + checkpoints + sampler state intact.!target_trimmed→ unchanged conservative full reset (the original non-Transformer fall-through case that the existing comment alludes to).Validation
Tested on Qwen3.6-27B +
--multi-token-prediction --draft-max 3+--reasoning on. Combined with #1888 (qwen3next checkpoint reuse), multi-pass synthesis goes from 0% prefix-cache reuse to 92% reuse on shared-prefix follow-up calls. Without this patch the companion-trim failure path still wiped the primary cache and undid the checkpoint fix.Note
This patch addresses the pre-batch reset site that exists in current
main. PR #1877 (Fix prompt cache viability) introduces a similar trim-fallback at the second post-prefix-match site; the same split should be applied there when that PR lands.