codex 改用整体 wall-clock timeout,删除两处 no-output 误杀(stall→timeout)#14
Merged
Conversation
dogfood 实证:codex 在深度推理或跑测试时长时间无 stdout/stderr 输出,会被 按「输出沉默=卡死」误判并 SIGKILL,error_kind=stall。误杀有两处: - sdk_codex.rs wait_codex_with_stall_window:per-codex 的 no-output 窗口杀; - supervise/spawner.rs wait_with_stall_window:framework-child 的 no-output 窗口杀, 由 Department M.spec.stall_window 驱动,consensus decide 实际是「静默 2min 杀」。 改成纯 timeout + 退出判断: - codex 仅由从启动计时的整体 wall-clock timeout 约束,输出不再续命; opt stall_window 改名 timeout(无兼容 shim),默认 900→3600,kind stall→timeout, 日志字段 STALL_WINDOW→TIMEOUT_SECONDS,exit 仍 124。 - spawner.rs 删除 framework-child no-output 杀及 STALLED/STALL_KILL_PID/ LAST_OUTPUT_AGE_MS 字段;framework child 跑到自然退出,崩溃恢复仍靠投递 lease。 - stdin 写入也纳入 timeout:prompt 写入移到独立 writer 线程与 wait 并发, 超时 SIGKILL 断开管道解阻塞 writer 再 join,修复 large prompt + 子进程不读 stdin 时 write_all 永久阻塞泄 permit 的缺陷。 - consumer.rs 的 M.spec.stall_window 仍只作可靠投递 retry_lease/renew_running,不动。 包侧 decide/implement 不再把 lease 值当 codex 超时传,改专用 timeout(另 PR)。 sshx:thinking(min/struct/del)→implement→review(qual reject:stdin 写未受 timeout 覆盖) →fix(writer 线程)→re-review(concurrency+quality approve)。cargo fmt + cargo test -p fkst-framework 全绿(367 passed, 0 failed)。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
动机
dogfood 实证:codex 子进程在深度推理 / 跑测试等长时间无 stdout/stderr 输出时,被按「输出沉默 = 卡死」误判并 SIGKILL(
error_kind="stall")。误杀有两处:sdk_codex.rswait_codex_with_stall_window:per-codex 的 no-output 窗口杀;supervise/spawner.rswait_with_stall_window:framework-child 的 no-output 窗口杀,由 DepartmentM.spec.stall_window驱动——consensus decide实际是「静默 2min 杀」,很可能是共识 flaky / churn 的一大来源。改动(纯 timeout + 退出判断)
sdk_codex.rs:codex 仅由从启动计时的整体 wall-clocktimeout约束,输出不再续命。optstall_window→timeout(无 compat shim),默认900→3600,kindstall→timeout,日志STALL_WINDOW→TIMEOUT_SECONDS,exit 仍124。supervise/spawner.rs:删除 framework-child no-output 杀及STALLED/STALL_KILL_PID/LAST_OUTPUT_AGE_MS;framework child 跑到自然退出,崩溃恢复仍靠投递 lease。write_all永久阻塞、泄 codex permit 的缺陷。consumer.rs:M.spec.stall_window仍只作可靠投递retry_lease/renew_running(delivery lease),不动。测试证据
cargo fmt:PASScargo test -p fkst-framework:367 passed, 0 failedspawn_codex_sync_stdout/stderr_activity_does_not_extend_timeout(持续输出不续命)、spawn_codex_sync_silent_child_exits_before_timeout_successfully(静默但工作不再误杀)、spawn_codex_sync_silent_child_returns_timeout_failure/..._kills_timed_out_child_process_group(超时仍杀)、large-prompt + 子进程不读 stdin 的 stdin-write-bounded-by-timeout 回归测试、framework-child 静默自然退出。评审
sshx:thinking triplet(minimal/structural/delete 收敛——含发现第二处 spawner.rs kill)→ implement → review triplet(quality reject:stdin 写未受 timeout 覆盖会挂死泄 permit)→ fix(writer 线程并发)→ re-review(concurrency + quality approve)。
后续
fkst-packages的consensus/decide、github-devloop/implement不再把M.spec.stall_window(lease 值)当 codex 超时传,改传专用timeout(独立 PR)。⟦AI:FKST⟧
🤖 Generated with Claude Code