Skip to content

server: gate llama_decode_stop() so a queued request's cancel can't abort the active decode#1941

Draft
slundell wants to merge 1 commit into
ikawrakow:mainfrom
slundell:fix/cancel-cascade-gate-decode-stop
Draft

server: gate llama_decode_stop() so a queued request's cancel can't abort the active decode#1941
slundell wants to merge 1 commit into
ikawrakow:mainfrom
slundell:fix/cancel-cascade-gate-decode-stop

Conversation

@slundell

@slundell slundell commented Jun 9, 2026

Copy link
Copy Markdown

What

With llama-server --parallel 1, a client disconnect/timeout on a request that is queued (not the one currently decoding) aborts the active decode belonging to a different client:

llama_decode: failed to decode, ret = -3
Decode process is cancelled by user.

The active slot is then released with the request unfinished. From the active client's side the stream silently stalls and never returns (no error; backend requests_processing drops to 0; the pod stays healthy), which is easy to misdiagnose as a network/proxy wedge. In short: one client's routine cancel kills another client's in-flight generation. Any setup where more than one request can be in flight against a single-slot server — e.g. an agent that issues auxiliary/background completions concurrently with a long main turn against the same endpoint — hits this whenever a queued call is cancelled.

Root cause

llama_decode_stop() signals a process-global "stop decode" flag that the active decode loop polls and returns -3 on. examples/server/server.cpp calls it ungated from the request reader's connection-closed paths — there is no check that the closing reader's task is the one actually on a slot. So when a queued task's reader disconnects, it takes the connection-closed branch, calls llama_decode_stop(), and trips the global flag against the active decode (a different task) → the ret = -3 cascade.

This is adjacent to #1576 / #1673 ("clear sticky stop flag" + hybrid/recurrent prompt-cache ret = -3). #1673 fixed the sticky-flag / hybrid-checkpoint facets but did not gate these llama_decode_stop() call sites against non-active readers, so the queued-cancel-kills-active cascade still fires on current main.

This PR (minimal gate)

Adds server_response_reader::any_task_on_slot() — true only when one of this reader's tasks is currently on a slot (the active decode) — and gates the three llama_decode_stop() call sites on it. A queued task's disconnect then only drops that queued task and never touches the global flag; the active decode of another task is left alone. +15 / −3, no behavior change for the active-decode path.

Verified in production under heavy concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero active-decode kills observed).

This is the "minimal" of the two directions; the deeper alternative is to replace the process-global stop flag with a per-context / per-task cancellation so a cancel can only ever target its own decode (removes the footgun class entirely, more invasive). Happy to reshape toward that if you'd prefer.

Caveat: any_task_on_slot() reads the slots vector from the reader thread — the same race class as the existing process-global flag (slots are not resized at runtime). Can be tightened if you'd like.

Reproduction

Single-slot server (any GGUF; small is fine):

llama-server -m <model.gguf> --parallel 1 --host 127.0.0.1 --port 8080 --ctx-size 8192
python3 repro_cancel_cascade.py

The reproducer (stdlib-only) fires request A (large prompt → active decode), then opens a second connection, sends request B (queues behind A), and hard-closes B's socket — a normal client disconnect of the queued task.

Before (buggy): A dies shortly after B is closed —

slot is processing task | id_slot=0 id_task=31     # active = A
srv  stop: cancel task, id_task = 51               # queued = B, client disconnected
llama_decode: failed to decode, ret = -3
Decode process is cancelled by user.
release_slots | id_slot=0 id_task=31 n_past=5990    # A killed by B's cancel

The cancelled task (51) and the released task (31) are different.

After (this PR): A completes normally; B's disconnect drops only the queued task.

repro_cancel_cascade.py (Python 3 stdlib only)
#!/usr/bin/env python3
"""
Minimal reproducer for the cancel-cascade in ik_llama.cpp's llama-server with
`--parallel 1`. When a client disconnects from a QUEUED request, the server
calls the process-global llama_decode_stop(), which aborts the ACTIVE decode of
a *different* client (ret = -3) and releases the slot. Stdlib only.

  llama-server -m <model.gguf> --parallel 1 --host 127.0.0.1 --port 8080 --ctx-size 8192
  python3 repro_cancel_cascade.py

Env: SERVER_URL, MODEL, PROMPT_TOKENS (default 4000), CANCEL_AFTER_S (default 8).
Buggy build: the active request DIES shortly after the queued request is closed.
Fixed build: the active request completes; the queued disconnect is harmless.
"""
import json, os, socket, threading, time, urllib.parse, urllib.request

URL = os.environ.get("SERVER_URL", "http://127.0.0.1:8080/v1/chat/completions")
MODEL = os.environ.get("MODEL", "default")
PROMPT_TOKENS = int(os.environ.get("PROMPT_TOKENS", "4000"))
CANCEL_AFTER_S = float(os.environ.get("CANCEL_AFTER_S", "8"))

_filler = "The quick brown fox jumps over the lazy dog. " * (PROMPT_TOKENS // 9 + 1)
A_BODY = json.dumps({
    "model": MODEL, "stream": False, "max_tokens": 512, "temperature": 0,
    "messages": [{"role": "user",
                  "content": _filler + "\nNow count slowly from 1 to 500, one number per line."}],
}).encode()
B_BODY = json.dumps({
    "model": MODEL, "max_tokens": 8,
    "messages": [{"role": "user", "content": "ping"}],
}).encode()

fg = {"outcome": None, "ms": None}


def foreground():  # request A — becomes the ACTIVE decode
    t0 = time.monotonic()
    try:
        req = urllib.request.Request(
            URL, data=A_BODY, method="POST",
            headers={"Content-Type": "application/json", "Connection": "close"})
        with urllib.request.urlopen(req, timeout=600) as r:
            r.read()
        fg["outcome"] = "OK"
    except Exception as e:
        fg["outcome"] = f"DIED: {type(e).__name__}: {str(e)[:60]}"
    fg["ms"] = int((time.monotonic() - t0) * 1000)


def background_queue_then_cancel():  # request B — queues behind A, then client-disconnects
    time.sleep(CANCEL_AFTER_S)
    u = urllib.parse.urlparse(URL)
    s = socket.create_connection((u.hostname, u.port or 80), timeout=5)
    raw = (f"POST {u.path} HTTP/1.1\r\nHost: {u.hostname}\r\n"
           f"Content-Type: application/json\r\nContent-Length: {len(B_BODY)}\r\n"
           f"Connection: close\r\n\r\n").encode() + B_BODY
    s.sendall(raw)   # B now QUEUED behind the active A (--parallel 1)
    print(f"[{time.monotonic():.0f}] queued request B sent; waiting 4s, then hard-close (client cancel)")
    time.sleep(4)
    s.shutdown(socket.SHUT_RDWR)
    s.close()        # client disconnect == cancel of the QUEUED task
    print(f"[{time.monotonic():.0f}] B hard-closed")


print(f"target={URL}  prompt~{PROMPT_TOKENS}tok  cancel B after {CANCEL_AFTER_S}s")
ta = threading.Thread(target=foreground)
tb = threading.Thread(target=background_queue_then_cancel)
ta.start(); tb.start(); ta.join(); tb.join()

print(f"\nforeground (active request): {fg['outcome']} in {fg['ms']}ms")
if fg["outcome"] and fg["outcome"].startswith("DIED"):
    print("VERDICT: CASCADE REPRODUCED — a queued task's cancel killed the active decode "
          "(confirm `ret = -3` + release of the active id_task in the server log).")
else:
    print("VERDICT: foreground survived. Either the build is fixed, or A finished before B "
          "was cancelled — raise PROMPT_TOKENS and/or lower CANCEL_AFTER_S and retry.")

…cel cascade)

With --parallel 1, a client disconnect/timeout on a *queued* request aborts the
*active* decode of a different client (llama_decode: failed to decode, ret = -3 /
"Decode process is cancelled by user"), releasing the slot with the request
unfinished. To the active client the stream silently stalls and never returns,
while the server reports healthy — easy to misdiagnose as a network/proxy wedge.

Root cause: llama_decode_stop() signals a process-global stop flag that the
active decode loop polls. examples/server/server.cpp calls it *ungated* from the
request reader's connection-closed paths, so any reader closing (including a
queued, not-yet-running task's) trips the global flag against whatever decode is
currently active. Adjacent to ikawrakow#1576/ikawrakow#1673 ("clear sticky stop flag" +
hybrid/recurrent ret=-3), which did not gate these call sites against non-active
readers, so the queued-cancel-kills-active cascade still fires on current main.

Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the
three llama_decode_stop() sites on it, so the global stop is signalled only when
one of THIS reader's tasks is on a slot (the active decode). A queued task's
disconnect then only drops that queued task. Verified in production under heavy
concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero
active-decode kills). Stdlib-only reproducer in the PR description.

Caveat: any_task_on_slot() reads the slots vector from the reader thread — the
same race class as the existing process-global flag; can be tightened to a
per-context/per-task cancellation if preferred.
@ikawrakow

Copy link
Copy Markdown
Owner

@slundell

The PR is set to draft, is something still missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants