Skip to content

codex 改用整体 wall-clock timeout,删除两处 no-output 误杀(stall→timeout)#14

Merged
loning merged 1 commit into
devfrom
fix/codex-overall-timeout
Jun 9, 2026
Merged

codex 改用整体 wall-clock timeout,删除两处 no-output 误杀(stall→timeout)#14
loning merged 1 commit into
devfrom
fix/codex-overall-timeout

Conversation

@loning

@loning loning commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

动机

dogfood 实证:codex 子进程在深度推理 / 跑测试等长时间无 stdout/stderr 输出时,被按「输出沉默 = 卡死」误判并 SIGKILL(error_kind="stall")。误杀有两处

  1. sdk_codex.rs wait_codex_with_stall_window:per-codex 的 no-output 窗口杀;
  2. supervise/spawner.rs wait_with_stall_window:framework-child 的 no-output 窗口杀,由 Department M.spec.stall_window 驱动——consensus decide 实际是「静默 2min 杀」,很可能是共识 flaky / churn 的一大来源。

改动(纯 timeout + 退出判断)

  • sdk_codex.rs:codex 仅由从启动计时的整体 wall-clock timeout 约束,输出不再续命。opt stall_windowtimeout(无 compat shim),默认 9003600kind stalltimeout,日志 STALL_WINDOWTIMEOUT_SECONDS,exit 仍 124
  • supervise/spawner.rs:删除 framework-child no-output 杀及 STALLED/STALL_KILL_PID/LAST_OUTPUT_AGE_MS;framework child 跑到自然退出,崩溃恢复仍靠投递 lease。
  • stdin 写入纳入 timeout:prompt 写入移到独立 writer 线程与 wait 并发;超时 SIGKILL 断开管道解阻塞 writer 再 join。修复 large prompt(超管道缓冲)+ 子进程不读 stdin 时 write_all 永久阻塞、泄 codex permit 的缺陷。
  • consumer.rsM.spec.stall_window 仍只作可靠投递 retry_lease/renew_running(delivery lease),不动

测试证据

  • cargo fmt:PASS
  • cargo test -p fkst-framework367 passed, 0 failed
  • 关键新/改测试:spawn_codex_sync_stdout/stderr_activity_does_not_extend_timeout(持续输出不续命)、spawn_codex_sync_silent_child_exits_before_timeout_successfully(静默但工作不再误杀)、spawn_codex_sync_silent_child_returns_timeout_failure / ..._kills_timed_out_child_process_group(超时仍杀)、large-prompt + 子进程不读 stdin 的 stdin-write-bounded-by-timeout 回归测试、framework-child 静默自然退出。

评审

sshx:thinking triplet(minimal/structural/delete 收敛——含发现第二处 spawner.rs kill)→ implement → review triplet(quality reject:stdin 写未受 timeout 覆盖会挂死泄 permit)→ fix(writer 线程并发)→ re-review(concurrency + quality approve)。

后续

fkst-packagesconsensus/decidegithub-devloop/implement 不再把 M.spec.stall_window(lease 值)当 codex 超时传,改传专用 timeout(独立 PR)。

⟦AI:FKST⟧

🤖 Generated with Claude Code

dogfood 实证:codex 在深度推理或跑测试时长时间无 stdout/stderr 输出,会被
按「输出沉默=卡死」误判并 SIGKILL,error_kind=stall。误杀有两处:
- sdk_codex.rs wait_codex_with_stall_window:per-codex 的 no-output 窗口杀;
- supervise/spawner.rs wait_with_stall_window:framework-child 的 no-output 窗口杀,
  由 Department M.spec.stall_window 驱动,consensus decide 实际是「静默 2min 杀」。

改成纯 timeout + 退出判断:
- codex 仅由从启动计时的整体 wall-clock timeout 约束,输出不再续命;
  opt stall_window 改名 timeout(无兼容 shim),默认 900→3600,kind stall→timeout,
  日志字段 STALL_WINDOW→TIMEOUT_SECONDS,exit 仍 124。
- spawner.rs 删除 framework-child no-output 杀及 STALLED/STALL_KILL_PID/
  LAST_OUTPUT_AGE_MS 字段;framework child 跑到自然退出,崩溃恢复仍靠投递 lease。
- stdin 写入也纳入 timeout:prompt 写入移到独立 writer 线程与 wait 并发,
  超时 SIGKILL 断开管道解阻塞 writer 再 join,修复 large prompt + 子进程不读
  stdin 时 write_all 永久阻塞泄 permit 的缺陷。
- consumer.rs 的 M.spec.stall_window 仍只作可靠投递 retry_lease/renew_running,不动。

包侧 decide/implement 不再把 lease 值当 codex 超时传,改专用 timeout(另 PR)。

sshx:thinking(min/struct/del)→implement→review(qual reject:stdin 写未受 timeout 覆盖)
→fix(writer 线程)→re-review(concurrency+quality approve)。cargo fmt + cargo test
-p fkst-framework 全绿(367 passed, 0 failed)。

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@loning loning merged commit 670d146 into dev Jun 9, 2026
4 checks passed
@loning loning deleted the fix/codex-overall-timeout branch June 9, 2026 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant