Skip to content

fix(ui): unblock stuck question prompt when interruptions race#457

Open
omercnet wants to merge 2 commits into
NeuralNomadsAI:devfrom
omercnet:fix/question-prompt-stuck-on-loading
Open

fix(ui): unblock stuck question prompt when interruptions race#457
omercnet wants to merge 2 commits into
NeuralNomadsAI:devfrom
omercnet:fix/question-prompt-stuck-on-loading

Conversation

@omercnet
Copy link
Copy Markdown
Contributor

@omercnet omercnet commented May 16, 2026

Summary

When an agent issues a question while a permission prompt is ahead in the interruption queue (or during a brief SSE disconnect right after a reply), the inline question block can render with options visible but every input disabled and the Submit button hidden. The state is indistinguishable from "the system is still loading" and the user can neither pick an option nor dismiss the prompt.

This PR makes the inline <QuestionToolBlock> interactive iff the v2 message store agrees the question is the current interruption, and otherwise renders an explicit "Queued" banner explaining that another interruption is ahead. Submitting clears the prompt immediately on success even if the server's confirming SSE event is delayed; on failure the prompt is restored and the existing error path is used.

This likely overlaps with #448, which reports the same user-facing symptom from the Electron side ("Waiting for earlier responses" deadlock with parallel subagents). Happy to coordinate — see the note at the bottom.

Implementation Overview

  • Unified active-state source. New pure helper packages/ui/src/components/tool-call/question-active.ts derives "is this question active" from the v2 message store only (head of the v2 question queue and no permission interruption ahead). tool-call.tsx::isQuestionActive now uses it. The legacy activeInterruption signal is preserved for the permission approval modal and the notification banner — only the inline prompt's gating changes.
  • Honest queued UI. <QuestionToolBlock> renders a dedicated queued-state branch (label + hint, no inputs, no spinner, no Submit) when props.active() === false and the request is still pending. The dead legacy queuedText fallback was removed.
  • Optimistic clear with rollback. instances.ts::sendQuestionReply / sendQuestionReject snapshot the v2 entry, remove it before the network call, and restore the snapshot on failure. Closes the post-submit transient window where the legacy queue was cleared but the v2 entry was still rendered until the SSE confirmation arrived.
  • Diagnostic logs. Four log points using the existing getLogger module: interruption.active.changed, question.reply.start / question.reject.start (plus optimistic-clear / rollback), question.asked with a duplicate boolean, and question.answered with a localStoreHadEntry boolean. Ids and booleans only; no answer text or attachments.
  • i18n. New keys toolCall.question.queuedLabel and toolCall.question.queuedHint added to all seven locales (en, es, fr, he, ja, ru, zh-Hans).

Edge Cases / Platform Considerations

  • Web only. Electron and Tauri shells are not touched.
  • Permission interruption ahead of a question (the most common stuck-state path): the new helper returns false, so the queued banner renders instead of a misleading disabled radio list.
  • SSE reconnect drops the confirming question.replied event: the optimistic clear has already removed the v2 entry, so the prompt does not redraw in a stuck state on reconnect.
  • Multi-question queue within a session: only the head question is active; trailing questions render the queued banner.
  • Permission approval modal and notification banner: intentionally unchanged; they still read from activeInterruption, so cross-cutting behavior is preserved.

Validation

  • 23/23 UI tests pass under node:test via tsx, including two new test files:
    • packages/ui/src/components/tool-call/question-active.test.ts (4 cases, including the permission-ahead-of-question case).
    • packages/ui/src/stores/question-optimistic-clear.test.ts (3 cases covering the optimistic clear and the rollback on network failure).
  • tsc --noEmit clean for packages/ui.
  • vite build clean for packages/ui.
  • Manual verification on the web build:
    1. Active question prompt unchanged.
    2. Queued banner renders when another interruption is ahead.
    3. Submit clears the prompt immediately even when the SSE confirmation is delayed.

Repro (likely the same path as #448)

I do not have a 100% deterministic repro — the timing is racy — but the structural defect that produces the symptom is reachable along these paths:

  1. Permission-ahead-of-question (highest confidence): tool that requires permission fires while a question is also pending, or in quick succession. activeInterruption points at the permission; the question's tool part still mounts so its options render; the inline block's old gate (which read activeInterruption) disables every input. The user sees options and a missing Submit — visually identical to "loading."
  2. SSE reconnect right after a reply: client sends question.reply, legacy queue clears, network blips before the confirming question.replied event arrives. The v2 entry is still present and rendered; the legacy gate no longer marks it active.
  3. Multiple questions queued in the same session: trailing entries render with the same disabled-radio appearance, no indication they are queued.

If #448's environment can produce two near-simultaneous questions (parallel subagents) plus any permission interruption, paths 1 and 3 stack and explain "Waiting for earlier responses" with options that cannot be answered.

Coordination with #448

Apologies for the noise in the first iteration of this PR — it included internal task-tracking and evidence artifacts that absolutely should not have been in an upstream review. Force-pushed a clean version: only product code, tests, and i18n now.

Happy to defer to your in-flight investigation on #448 if you would prefer to land a single fix that covers both reports, or to rebase on top of whatever direction you take. The fix here is intentionally narrow (inline-block gating only; permission modal and banner untouched) precisely so it composes with other work in the same area.

@shantur
Copy link
Copy Markdown
Collaborator

shantur commented May 16, 2026

HI @omercnet ,

Thanks for the PR.
This PR can't be merged in the current state as it has lots of unrelated changes.

Would you be able to let me know how to reproduce the issue that you are facing?
I am looking at another issue related to stuck question prompts #448 and would help

@omercnet omercnet force-pushed the fix/question-prompt-stuck-on-loading branch from 7c14ce0 to 2030fae Compare May 16, 2026 13:05
@omercnet
Copy link
Copy Markdown
Contributor Author

Hi @shantur — thanks for the quick review and apologies for the noise. You were absolutely right; the first push contained internal task-tracking and evidence artifacts that had no business being in an upstream review. Just force-pushed a clean version: only product code, tests, and i18n now (14 files, +413/-8 instead of the previous 24 files / +1379/-8).

On repro and the overlap with #448 — I think they are very likely the same underlying bug. I do not have a deterministic repro either; the symptom is racy by nature. But the structural defect I am targeting in this PR is reachable along three paths:

  1. Permission-ahead-of-question (my main suspect, and consistent with @WolfgangFahl's "options visible, cannot answer" screenshots): a tool needing permission fires near a question. The legacy activeInterruption signal points at the permission while the question's tool part still mounts. The inline question block was reading activeInterruption for its enable/submit gate — so options render but every input is disabled and Submit is hidden. Visually identical to "loading."
  2. SSE reconnect right after a reply: client sends question.reply, legacy queue clears, network blips before the confirming question.replied SSE event arrives. The v2 entry is still rendered and the legacy gate no longer marks it active. Matches the "stuck after answering one" reports.
  3. Multiple queued questions in one session (parallel subagents — directly matches [Bug]: questions do not open and garble leading to deadlock "Waiting for earlier responses" #448's reported trigger): trailing entries render the same disabled-radio appearance with no indication they are queued. Looks like a deadlock to the user.

The fix is intentionally narrow:

  • Inline <QuestionToolBlock> now derives "active" from the v2 message store only (head of v2 question queue and no permission ahead). The permission approval modal and the notification banner continue to read activeInterruption so other surfaces are untouched.
  • A real "Queued" banner replaces the silently-disabled radio list when a question is pending but not at the head.
  • sendQuestionReply / sendQuestionReject clear the v2 entry optimistically with rollback on failure, so SSE delays cannot leave the prompt rendered after a successful answer.
  • Four diagnostic log points (ids/booleans only — no answer content) to make this state observable in production telemetry going forward.

Re: #448 specifically — if @WolfgangFahl can share .opencode/opencode.json (or ~/.config/opencode/opencode.json) per your earlier request, that would help confirm path 1. The "two identical questions waiting for each other" screenshot in the latest comment lines up with path 3.

Happy to:

Whatever is easiest on your side.

@shantur
Copy link
Copy Markdown
Collaborator

shantur commented May 16, 2026

Hey @omercnet ,

Thanks, I did try to reproduce this with multiple variations and permissions but seems like my settings or environment aren't reproducing it..

Would you be able to try and reproduce this with some prompts in your environment in a new session, hopefully it will be easier

User-visible behavior change:

Previously, when an agent issued a question while a permission prompt was
ahead in the interruption queue (or during a brief SSE disconnect right
after a reply), the inline question block could render with options
visible but every input disabled and the Submit button hidden. The state
looked indistinguishable from "the system is still loading" and the user
could neither pick an option nor dismiss the prompt.

After this change, the inline <QuestionToolBlock> is interactive iff the
v2 message store agrees that the question is the current interruption,
and otherwise renders an explicit "Queued" banner with a short hint
explaining that another interruption is ahead. Submitting or dismissing
the prompt clears it from the UI immediately on success, even if the
server's confirming SSE event is delayed; on failure the prompt is
restored and the error surfaces through the existing path.

Implementation approach:

- tool-call.tsx::isQuestionActive was rewritten to derive its result
  from the v2 message store only. The rule lives in a new pure helper at
  packages/ui/src/components/tool-call/question-active.ts: a question is
  active when it is the head of the v2 question queue AND no permission
  interruption is ahead in the v2 store. The legacy activeInterruption
  signal is preserved for cross-cutting consumers (permission approval
  modal, banner) but no longer gates the inline prompt; that split was
  the structural defect causing the symptom.
- <QuestionToolBlock> now renders a dedicated queued-state branch
  (label + hint, no inputs, no spinner, no Submit) instead of a fully
  disabled radio list when props.active() === false and the request is
  still pending. The dead legacy queuedText fallback was removed.
- instances.ts::sendQuestionReply and sendQuestionReject snapshot the
  v2 entry, call removeQuestionV2 before the network request, await the
  HTTP reply, and restore the snapshot on rejection. This closes the
  post-submit transient window where the legacy queue had been cleared
  but the v2 entry was still rendered until the SSE confirmation
  arrived.
- Four diagnostic log points are added using the existing getLogger
  module (no new logger introduced): interruption.active.changed,
  question.reply.start / question.reject.start plus their optimistic
  clear / rollback events, question.asked with a duplicate boolean,
  and question.answered with a localStoreHadEntry boolean. Payloads
  contain ids and booleans only; answer text and attachments are never
  logged.
- New i18n keys toolCall.question.queuedLabel and
  toolCall.question.queuedHint were added to every locale under
  packages/ui/src/lib/i18n/messages/ (en, es, fr, he, ja, ru, zh-Hans).

Edge cases and platform considerations:

- Permission interruption ahead of a question: the new helper rule
  returns false, so the queued banner renders instead of a misleading
  disabled radio list.
- SSE reconnect drops the confirming question.replied event: the
  optimistic clear has already removed the v2 entry, so the prompt
  does not redraw in a stuck state.
- Multi-question queue within a session: only the head question is
  active; trailing questions render the queued banner.
- Permission approval modal and notification banner are intentionally
  untouched and continue to read from activeInterruption.

Validation:

- 28/29 UI tests pass under node:test via tsx (one pre-existing
  failure in session-status.test.ts on dev, unrelated to this change).
  Three test files cover task 059:
    - question-active.test.ts (4 cases, unit-level coverage of the
      new helper including the permission-ahead branch).
    - question-optimistic-clear.test.ts (3 cases covering the v2 store
      remove/restore invariants the rollback path depends on).
    - question-concurrency.test.ts (5 scenario-level cases reproducing
      the three failure modes called out in the investigation and
      observed in issue NeuralNomadsAI#448: back-to-back questions, permission ahead
      of a question, post-submit lifecycle with delayed SSE
      confirmation, rollback after a failed reply, and permission
      ahead of a multi-question queue).
- tsc --noEmit clean for packages/ui.
- vite build clean for packages/ui.
- Manual verification on the web build: active prompt unchanged, new
  queued banner renders when another interruption is ahead, submit
  clears prompt immediately even when SSE confirmation is delayed.

Related: likely overlaps with the user-facing symptom reported in NeuralNomadsAI#448.
@omercnet omercnet force-pushed the fix/question-prompt-stuck-on-loading branch from 2030fae to d33a041 Compare May 16, 2026 14:17
@omercnet
Copy link
Copy Markdown
Contributor Author

@shantur — quick update:

Rebased the branch on the latest dev and added a 3rd test file that reproduces the three failure modes as scenario-level tests against the v2 message store: packages/ui/src/components/tool-call/question-concurrency.test.ts (5 cases).

What it covers:

  1. Two questions back-to-back in the same session — head is interactive, trailing renders the queued banner. Mirrors the "parallel subagents" trigger in [Bug]: questions do not open and garble leading to deadlock "Waiting for earlier responses" #448.
  2. Permission interruption arriving alongside a question — the v2 store correctly keeps the inline gate false while the permission is ahead. Mirrors @WolfgangFahl's "options visible, cannot answer" screenshots. The test also asserts the gate flips to true immediately after the permission resolves, with no SSE replay.
  3. Post-submit lifecycle with a delayed SSE confirmation — optimistic clear happens on reply, and the confirming question.replied event arriving later is idempotent.
  4. Rollback after a failed reply — the v2 entry is restored and the user can retry.
  5. Permission ahead of a multi-question queue — after the permission clears only the head becomes interactive; the trailing question stays queued.

These reproduce the bug at the store level deterministically — i.e. they document exactly what state shape produced the symptom and lock in the post-fix behavior. I tried to drive a UI-level repro in a live session in my own environment too but couldn't trigger it on demand either; the timing window is small and depends on the specific SSE ordering for that session. The structural defect is unambiguous from the code paths though (and from your colleague's screenshots in #448), and the store-level reproductions above pin it down precisely.

Test results on this branch: 28/29 UI tests pass under node:test via tsx. The one failure is the pre-existing session-status.test.ts on dev (Vite-only import.meta.env referenced under tsx --test) — unrelated to this PR; reproduced on clean dev.

PR is now at d33a041, mergeable, no conflicts with current dev. Happy to keep iterating, defer to your #448 work, or split into smaller commits if any of that would help.

@shantur
Copy link
Copy Markdown
Collaborator

shantur commented May 16, 2026

Hey @omercnet,

I need to be able to reproduce this bug locally, can you help me find a prompt or conditions to reproduce it.
Without being reproduced, I won't be able to confirm if the fixes are valid or not.

Can you try a few prompts to get in this situation by asking model to generate multiple questions in one tool or multiple tool calls or whatever you think is causing the issue.

@omercnet
Copy link
Copy Markdown
Contributor Author

@shantur — the bug reproduced in my own session about 90 minutes ago, and I dug the forensics out of the opencode SQLite store. Sharing here because I think this gives both you and @WolfgangFahl a concrete trigger you can use (cross-tagging Wolfgang so he sees the recipe in case it matches what he hit on PyExpireBackups in #448).

What happened in my session

Live evidence pulled from ~/.local/share/opencode/opencode.db:

Field Value
Session ses_1cee4eca8ffeDiKR9IknX5kVFL
Stuck question part prt_e3149826f001tsNhKHK1cQo5f2 (tool=question)
Part created 2026-05-16T14:56:01.647Z
Last update 2026-05-16T14:56:08.096Z (then nothing for ~30 min until I noticed it)
Server status running — never transitioned to completed
state.output undefined
state.time.end missing
Follow-up assistant messages zero

The question was born running and stayed running forever. The reply never reached the server — state.output was never written.

The trigger

41 seconds before the stuck question, an MCP tool errored. Reconstructed timeline:

14:55:20.252Z  context-mode_ctx_batch_execute  →  status: "error"
                                                   error: "Not connected"
14:55:26.439Z  (error state propagated through SSE)
14:55:29…57Z   7 normal bash tool calls ran successfully
14:55:59.681Z  new assistant message turn starts (step-start, text)
14:56:01.647Z  question tool fires  →  status: "running"   <-- STUCK
14:56:08.096Z  state.time.start written (last activity)
…             (nothing more, ever)

A second stuck question in another session today (ses_21d7701caffeGLJ2cZs0tbBrY4, "Descope") follows the same pattern: context-mode_ctx_fetch_and_index"Not connected" ~7 minutes before its question got stuck. And while I was writing this very comment, a third question on the PMA session that found all this got stuck the same way — so I have a fresh third data point with identical shape.

UI symptom I observed (matches the PR's target)

The prompt rendered. The options were visible but the radios looked disabled and there was no Submit button. I literally couldn't click my answer — exactly the F-1 symptom this PR targets:

  • the v2 message store had the question entry (so the prompt mounted and options painted),
  • but activeInterruption was pointing at something stale from the errored tool's cleanup path,
  • so the legacy gate disabled every input and hid Submit.

After this PR's fix, the inline <QuestionToolBlock> no longer reads activeInterruption — it reads only the v2 store, which had the correct entry. The prompt would have been interactive and I could have replied.

Reproducer for you and Wolfgang

This is a deterministic-enough recipe that you can drive without DeepSeek:

1. Configure any MCP server in opencode.jsonc (or use context-mode if you have it).
2. Start a session, get the agent to call the MCP tool.
3. While the MCP call is in flight, kill the MCP server process or
   sever its socket so the call returns "Not connected".
4. Continue the session normally — the agent will keep working with bash/other tools.
5. Have the agent invoke the `question` tool.

→ The question renders stuck: options visible, radios disabled, no Submit.
→ Server-side: state.status="running", state.output=undefined, no time.end.

Without the MCP-tool failure trigger you won't see it — which is why your environment doesn't reproduce. @WolfgangFahl's recipe (DeepSeek + the two prompts) hits the same UI defect via a different trigger path (model emits parallel question tool calls).

Scope of this PR

This PR fixes the UI-side gate so the prompt is interactive whenever the v2 store has the question (and no permission is ahead) — independent of whatever stale legacy state the prior errored tool left behind. It also closes the post-submit transient window with an optimistic clear.

What it does not fix: the precise reason activeInterruption ends up stale after an errored MCP tool. I think that's a separate concern (and possibly worth a follow-up issue), because once the inline gate is decoupled from activeInterruption the user can always reply, the SDK gets the answer, and the session unwedges. The legacy signal can stay stale safely — only the permission modal and notification banner read it, and they don't gate question replies.

Happy to file a follow-up issue for the activeInterruption-after-errored-tool cleanup gap, separate from this PR. Or to coordinate with whatever you have in mind for #448 — your call on the shape.

@shantur
Copy link
Copy Markdown
Collaborator

shantur commented May 16, 2026

@omercnet

What model / provider do you use?

@shantur
Copy link
Copy Markdown
Collaborator

shantur commented May 16, 2026

Also, are you using Electron or Tauri builds?

@omercnet
Copy link
Copy Markdown
Contributor Author

@shantur — both answers:

Model / provider: claude-opus-4-7 via anthropic (direct Anthropic API — not OpenRouter, not via any proxy).

Build: Web, not Electron and not Tauri. Specifically: @neuralnomads/codenomad v0.16.0 installed via npm/mise, running headless under tmux:

codenomad --dangerously-skip-auth --host <tailscale-ip>

and accessed from a browser pointed at that host. So it's the packages/server + packages/ui path — same one I scoped this PR to.

Both stuck sessions today (ses_1cee4eca8ffe… and ses_21d7701caffe…) used the same setup: web build, Anthropic Opus 4.7, product_manager agent. The common thread isn't the model — it's that the product_manager agent is MCP-tool-heavy (lots of context-mode_* tool calls), so it hits the trigger condition more often than a vanilla developer/chat agent would.

That also explains why Wolfgang reproduces on a different setup (Electron + DeepSeek/OpenRouter) — the UI defect is the same, but his trigger is parallel question tool calls from the model rather than an upstream MCP-tool failure. Both paths end at the same broken UI state.

If you can stub or kill an MCP server connection mid-call in your local env (any MCP tool, doesn't have to be context-mode), and then have the agent call the question tool a few seconds later, I'd expect you to see the same stuck prompt on this PR's dev baseline — and a working prompt on this PR's d33a041 head.

The Comment PR Artifacts workflow was timing out after ~12 minutes
(30 iterations × 10s sleep) while PR Build Validation runs regularly
take 17+ minutes, causing every PR to show a failing CI check.

Increase polling to 60 iterations × 20s sleep (~20 min max) so the
comment workflow reliably waits for the full build to complete.

Co-authored-by: openhands <openhands@all-hands.dev>
@shantur
Copy link
Copy Markdown
Collaborator

shantur commented May 16, 2026

@omercnet try #465

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants