server: gate llama_decode_stop() so a queued request's cancel can't abort the active decode#1941
Draft
slundell wants to merge 1 commit into
Draft
Conversation
…cel cascade) With --parallel 1, a client disconnect/timeout on a *queued* request aborts the *active* decode of a different client (llama_decode: failed to decode, ret = -3 / "Decode process is cancelled by user"), releasing the slot with the request unfinished. To the active client the stream silently stalls and never returns, while the server reports healthy — easy to misdiagnose as a network/proxy wedge. Root cause: llama_decode_stop() signals a process-global stop flag that the active decode loop polls. examples/server/server.cpp calls it *ungated* from the request reader's connection-closed paths, so any reader closing (including a queued, not-yet-running task's) trips the global flag against whatever decode is currently active. Adjacent to ikawrakow#1576/ikawrakow#1673 ("clear sticky stop flag" + hybrid/recurrent ret=-3), which did not gate these call sites against non-active readers, so the queued-cancel-kills-active cascade still fires on current main. Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the three llama_decode_stop() sites on it, so the global stop is signalled only when one of THIS reader's tasks is on a slot (the active decode). A queued task's disconnect then only drops that queued task. Verified in production under heavy concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero active-decode kills). Stdlib-only reproducer in the PR description. Caveat: any_task_on_slot() reads the slots vector from the reader thread — the same race class as the existing process-global flag; can be tightened to a per-context/per-task cancellation if preferred.
ikawrakow
approved these changes
Jun 9, 2026
Owner
|
The PR is set to draft, is something still missing? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
With
llama-server --parallel 1, a client disconnect/timeout on a request that is queued (not the one currently decoding) aborts the active decode belonging to a different client:The active slot is then released with the request unfinished. From the active client's side the stream silently stalls and never returns (no error; backend
requests_processingdrops to 0; the pod stays healthy), which is easy to misdiagnose as a network/proxy wedge. In short: one client's routine cancel kills another client's in-flight generation. Any setup where more than one request can be in flight against a single-slot server — e.g. an agent that issues auxiliary/background completions concurrently with a long main turn against the same endpoint — hits this whenever a queued call is cancelled.Root cause
llama_decode_stop()signals a process-global "stop decode" flag that the active decode loop polls and returns-3on.examples/server/server.cppcalls it ungated from the request reader's connection-closed paths — there is no check that the closing reader's task is the one actually on a slot. So when a queued task's reader disconnects, it takes the connection-closed branch, callsllama_decode_stop(), and trips the global flag against the active decode (a different task) → theret = -3cascade.This is adjacent to #1576 / #1673 ("clear sticky stop flag" + hybrid/recurrent prompt-cache
ret = -3). #1673 fixed the sticky-flag / hybrid-checkpoint facets but did not gate thesellama_decode_stop()call sites against non-active readers, so the queued-cancel-kills-active cascade still fires on currentmain.This PR (minimal gate)
Adds
server_response_reader::any_task_on_slot()— true only when one of this reader's tasks is currently on a slot (the active decode) — and gates the threellama_decode_stop()call sites on it. A queued task's disconnect then only drops that queued task and never touches the global flag; the active decode of another task is left alone.+15 / −3, no behavior change for the active-decode path.Verified in production under heavy concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero active-decode kills observed).
This is the "minimal" of the two directions; the deeper alternative is to replace the process-global stop flag with a per-context / per-task cancellation so a cancel can only ever target its own decode (removes the footgun class entirely, more invasive). Happy to reshape toward that if you'd prefer.
Caveat:
any_task_on_slot()reads theslotsvector from the reader thread — the same race class as the existing process-global flag (slots are not resized at runtime). Can be tightened if you'd like.Reproduction
Single-slot server (any GGUF; small is fine):
The reproducer (stdlib-only) fires request A (large prompt → active decode), then opens a second connection, sends request B (queues behind A), and hard-closes B's socket — a normal client disconnect of the queued task.
Before (buggy): A dies shortly after B is closed —
The cancelled task (
51) and the released task (31) are different.After (this PR): A completes normally; B's disconnect drops only the queued task.
repro_cancel_cascade.py (Python 3 stdlib only)