fix(ui): unblock stuck question prompt when interruptions race#457
fix(ui): unblock stuck question prompt when interruptions race#457omercnet wants to merge 2 commits into
Conversation
7c14ce0 to
2030fae
Compare
|
Hi @shantur — thanks for the quick review and apologies for the noise. You were absolutely right; the first push contained internal task-tracking and evidence artifacts that had no business being in an upstream review. Just force-pushed a clean version: only product code, tests, and i18n now (14 files, +413/-8 instead of the previous 24 files / +1379/-8). On repro and the overlap with #448 — I think they are very likely the same underlying bug. I do not have a deterministic repro either; the symptom is racy by nature. But the structural defect I am targeting in this PR is reachable along three paths:
The fix is intentionally narrow:
Re: #448 specifically — if @WolfgangFahl can share Happy to:
Whatever is easiest on your side. |
|
Hey @omercnet , Thanks, I did try to reproduce this with multiple variations and permissions but seems like my settings or environment aren't reproducing it.. Would you be able to try and reproduce this with some prompts in your environment in a new session, hopefully it will be easier |
User-visible behavior change:
Previously, when an agent issued a question while a permission prompt was
ahead in the interruption queue (or during a brief SSE disconnect right
after a reply), the inline question block could render with options
visible but every input disabled and the Submit button hidden. The state
looked indistinguishable from "the system is still loading" and the user
could neither pick an option nor dismiss the prompt.
After this change, the inline <QuestionToolBlock> is interactive iff the
v2 message store agrees that the question is the current interruption,
and otherwise renders an explicit "Queued" banner with a short hint
explaining that another interruption is ahead. Submitting or dismissing
the prompt clears it from the UI immediately on success, even if the
server's confirming SSE event is delayed; on failure the prompt is
restored and the error surfaces through the existing path.
Implementation approach:
- tool-call.tsx::isQuestionActive was rewritten to derive its result
from the v2 message store only. The rule lives in a new pure helper at
packages/ui/src/components/tool-call/question-active.ts: a question is
active when it is the head of the v2 question queue AND no permission
interruption is ahead in the v2 store. The legacy activeInterruption
signal is preserved for cross-cutting consumers (permission approval
modal, banner) but no longer gates the inline prompt; that split was
the structural defect causing the symptom.
- <QuestionToolBlock> now renders a dedicated queued-state branch
(label + hint, no inputs, no spinner, no Submit) instead of a fully
disabled radio list when props.active() === false and the request is
still pending. The dead legacy queuedText fallback was removed.
- instances.ts::sendQuestionReply and sendQuestionReject snapshot the
v2 entry, call removeQuestionV2 before the network request, await the
HTTP reply, and restore the snapshot on rejection. This closes the
post-submit transient window where the legacy queue had been cleared
but the v2 entry was still rendered until the SSE confirmation
arrived.
- Four diagnostic log points are added using the existing getLogger
module (no new logger introduced): interruption.active.changed,
question.reply.start / question.reject.start plus their optimistic
clear / rollback events, question.asked with a duplicate boolean,
and question.answered with a localStoreHadEntry boolean. Payloads
contain ids and booleans only; answer text and attachments are never
logged.
- New i18n keys toolCall.question.queuedLabel and
toolCall.question.queuedHint were added to every locale under
packages/ui/src/lib/i18n/messages/ (en, es, fr, he, ja, ru, zh-Hans).
Edge cases and platform considerations:
- Permission interruption ahead of a question: the new helper rule
returns false, so the queued banner renders instead of a misleading
disabled radio list.
- SSE reconnect drops the confirming question.replied event: the
optimistic clear has already removed the v2 entry, so the prompt
does not redraw in a stuck state.
- Multi-question queue within a session: only the head question is
active; trailing questions render the queued banner.
- Permission approval modal and notification banner are intentionally
untouched and continue to read from activeInterruption.
Validation:
- 28/29 UI tests pass under node:test via tsx (one pre-existing
failure in session-status.test.ts on dev, unrelated to this change).
Three test files cover task 059:
- question-active.test.ts (4 cases, unit-level coverage of the
new helper including the permission-ahead branch).
- question-optimistic-clear.test.ts (3 cases covering the v2 store
remove/restore invariants the rollback path depends on).
- question-concurrency.test.ts (5 scenario-level cases reproducing
the three failure modes called out in the investigation and
observed in issue NeuralNomadsAI#448: back-to-back questions, permission ahead
of a question, post-submit lifecycle with delayed SSE
confirmation, rollback after a failed reply, and permission
ahead of a multi-question queue).
- tsc --noEmit clean for packages/ui.
- vite build clean for packages/ui.
- Manual verification on the web build: active prompt unchanged, new
queued banner renders when another interruption is ahead, submit
clears prompt immediately even when SSE confirmation is delayed.
Related: likely overlaps with the user-facing symptom reported in NeuralNomadsAI#448.
2030fae to
d33a041
Compare
|
@shantur — quick update: Rebased the branch on the latest What it covers:
These reproduce the bug at the store level deterministically — i.e. they document exactly what state shape produced the symptom and lock in the post-fix behavior. I tried to drive a UI-level repro in a live session in my own environment too but couldn't trigger it on demand either; the timing window is small and depends on the specific SSE ordering for that session. The structural defect is unambiguous from the code paths though (and from your colleague's screenshots in #448), and the store-level reproductions above pin it down precisely. Test results on this branch: 28/29 UI tests pass under PR is now at |
|
Hey @omercnet, I need to be able to reproduce this bug locally, can you help me find a prompt or conditions to reproduce it. Can you try a few prompts to get in this situation by asking model to generate multiple questions in one tool or multiple tool calls or whatever you think is causing the issue. |
|
@shantur — the bug reproduced in my own session about 90 minutes ago, and I dug the forensics out of the opencode SQLite store. Sharing here because I think this gives both you and @WolfgangFahl a concrete trigger you can use (cross-tagging Wolfgang so he sees the recipe in case it matches what he hit on What happened in my sessionLive evidence pulled from
The question was born The trigger41 seconds before the stuck question, an MCP tool errored. Reconstructed timeline: A second stuck question in another session today ( UI symptom I observed (matches the PR's target)The prompt rendered. The options were visible but the radios looked disabled and there was no Submit button. I literally couldn't click my answer — exactly the F-1 symptom this PR targets:
After this PR's fix, the inline Reproducer for you and WolfgangThis is a deterministic-enough recipe that you can drive without DeepSeek: Without the MCP-tool failure trigger you won't see it — which is why your environment doesn't reproduce. @WolfgangFahl's recipe (DeepSeek + the two prompts) hits the same UI defect via a different trigger path (model emits parallel question tool calls). Scope of this PRThis PR fixes the UI-side gate so the prompt is interactive whenever the v2 store has the question (and no permission is ahead) — independent of whatever stale legacy state the prior errored tool left behind. It also closes the post-submit transient window with an optimistic clear. What it does not fix: the precise reason Happy to file a follow-up issue for the |
|
What model / provider do you use? |
|
Also, are you using Electron or Tauri builds? |
|
@shantur — both answers: Model / provider: Build: Web, not Electron and not Tauri. Specifically: and accessed from a browser pointed at that host. So it's the Both stuck sessions today ( That also explains why Wolfgang reproduces on a different setup (Electron + DeepSeek/OpenRouter) — the UI defect is the same, but his trigger is parallel question tool calls from the model rather than an upstream MCP-tool failure. Both paths end at the same broken UI state. If you can stub or kill an MCP server connection mid-call in your local env (any MCP tool, doesn't have to be |
The Comment PR Artifacts workflow was timing out after ~12 minutes (30 iterations × 10s sleep) while PR Build Validation runs regularly take 17+ minutes, causing every PR to show a failing CI check. Increase polling to 60 iterations × 20s sleep (~20 min max) so the comment workflow reliably waits for the full build to complete. Co-authored-by: openhands <openhands@all-hands.dev>
Summary
When an agent issues a question while a permission prompt is ahead in the interruption queue (or during a brief SSE disconnect right after a reply), the inline question block can render with options visible but every input disabled and the Submit button hidden. The state is indistinguishable from "the system is still loading" and the user can neither pick an option nor dismiss the prompt.
This PR makes the inline
<QuestionToolBlock>interactive iff the v2 message store agrees the question is the current interruption, and otherwise renders an explicit "Queued" banner explaining that another interruption is ahead. Submitting clears the prompt immediately on success even if the server's confirming SSE event is delayed; on failure the prompt is restored and the existing error path is used.This likely overlaps with #448, which reports the same user-facing symptom from the Electron side ("Waiting for earlier responses" deadlock with parallel subagents). Happy to coordinate — see the note at the bottom.
Implementation Overview
packages/ui/src/components/tool-call/question-active.tsderives "is this question active" from the v2 message store only (head of the v2 question queue and no permission interruption ahead).tool-call.tsx::isQuestionActivenow uses it. The legacyactiveInterruptionsignal is preserved for the permission approval modal and the notification banner — only the inline prompt's gating changes.<QuestionToolBlock>renders a dedicated queued-state branch (label + hint, no inputs, no spinner, no Submit) whenprops.active() === falseand the request is still pending. The dead legacyqueuedTextfallback was removed.instances.ts::sendQuestionReply/sendQuestionRejectsnapshot the v2 entry, remove it before the network call, and restore the snapshot on failure. Closes the post-submit transient window where the legacy queue was cleared but the v2 entry was still rendered until the SSE confirmation arrived.getLoggermodule:interruption.active.changed,question.reply.start/question.reject.start(plus optimistic-clear / rollback),question.askedwith aduplicateboolean, andquestion.answeredwith alocalStoreHadEntryboolean. Ids and booleans only; no answer text or attachments.toolCall.question.queuedLabelandtoolCall.question.queuedHintadded to all seven locales (en, es, fr, he, ja, ru, zh-Hans).Edge Cases / Platform Considerations
false, so the queued banner renders instead of a misleading disabled radio list.question.repliedevent: the optimistic clear has already removed the v2 entry, so the prompt does not redraw in a stuck state on reconnect.activeInterruption, so cross-cutting behavior is preserved.Validation
23/23UI tests pass undernode:testviatsx, including two new test files:packages/ui/src/components/tool-call/question-active.test.ts(4 cases, including the permission-ahead-of-question case).packages/ui/src/stores/question-optimistic-clear.test.ts(3 cases covering the optimistic clear and the rollback on network failure).tsc --noEmitclean forpackages/ui.vite buildclean forpackages/ui.Repro (likely the same path as #448)
I do not have a 100% deterministic repro — the timing is racy — but the structural defect that produces the symptom is reachable along these paths:
activeInterruptionpoints at the permission; the question's tool part still mounts so its options render; the inline block's old gate (which readactiveInterruption) disables every input. The user sees options and a missing Submit — visually identical to "loading."question.reply, legacy queue clears, network blips before the confirmingquestion.repliedevent arrives. The v2 entry is still present and rendered; the legacy gate no longer marks it active.If #448's environment can produce two near-simultaneous questions (parallel subagents) plus any permission interruption, paths 1 and 3 stack and explain "Waiting for earlier responses" with options that cannot be answered.
Coordination with #448
Apologies for the noise in the first iteration of this PR — it included internal task-tracking and evidence artifacts that absolutely should not have been in an upstream review. Force-pushed a clean version: only product code, tests, and i18n now.
Happy to defer to your in-flight investigation on #448 if you would prefer to land a single fix that covers both reports, or to rebase on top of whatever direction you take. The fix here is intentionally narrow (inline-block gating only; permission modal and banner untouched) precisely so it composes with other work in the same area.