fix(runtime): drain shell pipes while child runs#7935
Conversation
WareWolf-MoonWall
left a comment
There was a problem hiding this comment.
PR #7935 β fix(runtime): drain shell pipes while child runs
Reviewer: WareWolf-MoonWall
Head: 09f89f5
CI: all green
Verdict: approve
Milestone: v0.8.3
π’ Drain-while-running architecture addresses both failure modes correctly
Starting stdout/stderr drain tasks immediately after spawn β rather than after child.wait() β means a pipe-buffer-full condition can never stall the child process regardless of output volume. The bounded POST_EXIT_DRAIN window then handles the inherited-handle case without waiting indefinitely. Both problems are solved structurally, not papered over.
π’ DrainOutput.truncated is set correctly in all overflow paths
The flag is set on both mid-read overflow (take < n) and on entry when the cap is already exhausted (remaining == 0). The drain task keeps running after hitting the cap β the child never blocks, and the caller gets accurate truncation reporting in both cases.
π’ Async mutex usage is sound
Arc<std::sync::Mutex<DrainOutput>> is the right choice here: the lock is held only for brief synchronous pushes with no .await inside the critical section, so there is no risk of blocking the executor or deadlocking across await points.
π’ Grandchild regression test closes the loop on inherited handles
shell_keeps_output_when_grandchild_holds_pipe_open is exactly the test that would have caught the original bug and will catch any future regression on that code path. Its inclusion makes the fix self-evidencing.
π΅ POST_EXIT_DRAIN timeout value warrants a doc comment
The constant governs a real operational tradeoff: too short and legitimate grandchildren producing output near exit lose data; too long and hung grandchildren delay tool results. If the value isn't already explained at its definition site, a brief comment stating the rationale (e.g. "N ms is sufficient for typical shell epilogue writes; grandchildren that outlive this window are treated as detached") would help future tuners make an informed adjustment.
Mechanically clean, all four regression tests are load-bearing, group_guard.disarm() placement and child.start_kill() on timeout are both correct. Approving.
There was a problem hiding this comment.
I read the single-file diff at head 09f89f5 and traced the drain lifecycle from spawn through exit, timeout, and the inherited-handle case. @WareWolf-MoonWall approved at this same head; I concur and have no new blockers. CI is green.
π’ What looks good: concurrent draining is the real deadlock fix
Starting spawn_drain for stdout and stderr immediately after spawn, rather than reading only after child.wait() returns, is exactly what #7871 needed. A child that produces more than a pipe buffer of output blocks on the full pipe while the parent blocks on wait(); classic deadlock. Now the drains run concurrently and drain_capped_into keeps reading past the capture cap (it stops storing once cap is hit but continues consuming), so a verbose child can never wedge on a full pipe regardless of how much it writes.
π’ What looks good: bounded post-exit drain handles inherited handles
finish_drain waits on the drain task with POST_EXIT_DRAIN and aborts on timeout, which is the right answer for a descendant that inherited the stdout/stderr fds: the pipe never sees EOF while the grandchild holds it open, so an unbounded wait would hang the shell tool forever. shell_keeps_output_when_grandchild_holds_pipe_open pins that the tool still returns the captured output instead of hanging. Reusing the existing POST_EXIT_DRAIN bound rather than inventing a new timeout keeps the behavior consistent with the prior post-exit path.
π’ What looks good: capture accounting and attribution are correct
DrainOutput.truncated is set both when a chunk is partially taken (take < n) and when the cap is already reached, and the truncation markers are appended on either truncated or a length overflow, so the "[output truncated at 1MB]" / "[stderr truncated at 1MB]" markers are preserved. The drain tasks are launched via zeroclaw_spawn::spawn!, so they inherit the turn's attribution span rather than logging un-attributed. shell_marks_stdout_truncated_after_limit / ..._stderr_... cover both marker paths.
Approving.
tidux
left a comment
There was a problem hiding this comment.
I checked out 09f89f5 locally, walked the new drain lifecycle against the consumers in crates/zeroclaw-runtime/src/tools/shell.rs, and ran the focused shell test slice. @WareWolf-MoonWall approved at this head on 2026-06-19, @singlerider approved on 2026-06-22, and I concur β no new blockers.
β Resolved β drain-while-running closes both failure modes from #7871
spawn_drain returns a DrainHandle that owns a zeroclaw_spawn::spawn!-launched task plus a shared Arc<Mutex<DrainOutput>> (shell.rs:481-499). The task starts reading immediately after spawn, before child.wait().await, so a pipe-buffer-full producer cannot wedge the child regardless of output volume. After exit, finish_drain waits up to POST_EXIT_DRAIN and aborts on timeout, which is exactly the inherited-grandchild-handle answer #7871 acceptance criteria call for. shell_drains_large_stdout_while_child_runs (200_000 bytes) and shell_keeps_output_when_grandchild_holds_pipe_open (printf done; (sleep 1) &) pin both shapes as regressions.
β Resolved β truncation accounting is honest under the new boundary
The cap logic in drain_capped_into (shell.rs:531-557) is a real improvement on the prior cap.saturating_sub(buf.len()).max(1) path, which had a subtle "store one extra byte at cap" overshoot. The new flow stops storing the moment remaining == 0 and flags truncated = true; partial-take on the cap-crossing iteration also sets truncated |= take < n. The combined check in the success branch (if stdout_capture.truncated || stdout.len() > MAX_OUTPUT_BYTES) covers both the drain-side flag and the post-decode length guard, so the truncation markers are preserved in both overflow shapes. shell_marks_stdout_truncated_after_limit / ..._stderr_... pin both marker paths.
π’ Cleanup discipline on every exit branch
Ok(Err(e)) and Err(_) both call tokio::join!(abort_drain(stdout_drain), abort_drain(stderr_drain)) so detached drain tasks cannot outlive the tool invocation. The timeout branch additionally calls child.start_kill() before aborting, and the ChildGroupGuard drop still SIGKILLs the process group on every non-success path because group_guard.disarm() is only called on Ok(Ok(status)). Cancel/timeout/error semantics are unchanged from the prior implementation; only the read ordering moved.
π’ Spawn macro is the sanctioned one
The drain tasks go through zeroclaw_spawn::spawn!, so they inherit the caller's attribution_span and emit runtime.task.spawn lifecycle records exactly like the rest of the workspace. A bare tokio::spawn here would have orphaned the spans for two long-running tasks per shell invocation β easy mistake to make in a high-risk runtime file, didn't happen here.
π’ Test design proves the contract, not the implementation
The four new tests drive the public ShellTool::execute API with real subprocesses and assert on the user-observable shape (result.output.len() == 200_000, marker suffix on result.output / result.error, result.output.contains("done")). They would fail if anyone reverted to read-after-wait or stripped the truncation markers, which is exactly the contract that needs pinning.
π΅ Suggestion β second on @WareWolf-MoonWall's POST_EXIT_DRAIN doc-comment ask
Confirmed at shell.rs:14 the constant is still a bare const POST_EXIT_DRAIN: Duration = Duration::from_millis(250); with no /// comment. Wolf's reasoning holds β the value is a real operational tradeoff between losing late grandchild output and waiting on detached descendants. A one-line /// stating "bounded so descendants that inherit pipe handles cannot stall the tool indefinitely; values larger than ~1s risk noticeable per-invocation latency for normal commands" would help the next tuner. Non-blocking; happy to land it in a follow-up if the author would rather not amend before merge.
Local verification at 09f89f5 (file checked out from pr-7935 over feat/tui-beautification worktree, restored after):
cargo fmt -p zeroclaw-runtime -- --checkcleancargo clippy -p zeroclaw-runtime --all-targets -- -D warningsclean (1m05s)cargo test -p zeroclaw-runtime --lib shell_β 82 passed; 0 failed; 0 ignored; 2221 filtered; finished in 0.43s β matches the PR body's82 passed / 0 failedshapegit merge-tree --write-tree pr-7935 upstream/masterβ clean tree, no conflicts against current master (8df6b8a)- All four new tests (
shell_drains_large_stdout_while_child_runs,shell_marks_stdout_truncated_after_limit,shell_marks_stderr_truncated_after_limit,shell_keeps_output_when_grandchild_holds_pipe_open) pass under#[cfg(unix)]
Cross-checked the consumer audit on the Ok(Err(e)) path: the Failed to execute command shape is preserved byte-for-byte, only the new abort_drain join was added before constructing ToolResult. The with_ephemeral_workspace_warning injection on both result.output and result.error is unchanged. No behavior regression for the unhappy paths.
Single GPG-unsigned commit, matches @Audacity88's existing pattern on this identity (#8068, #8184 same shape this session). Closes #7871 is the right tracker β the issue is itself v0.8.3-scoped with priority:p1, and the acceptance criteria there map 1:1 to what shipped. PR is already listed in #8071 ("[Tracker]: v0.8.3 runtime, agent, tools, and execution stability") at line 62 of the auto-rebuilt body, so no tracker edit needed.
Approving. Ready to merge at maintainer discretion.
Summary
master(all contributions)zeroclaw-runtime; callers that use the shell tool should see fewer timeout/hang failures for large output or descendant process trees that inherit output pipes.bug,risk: high,size: M,runtime,tool,tool:shellValidation Evidence (required)
Local validation is the signal CI cannot replace. Run the full battery and paste literal output (tails, failures, warnings - not "all passed").
cargo fmt --all -- --check cargo clippy --all-targets -- -D warnings cargo testDocs-only changes: replace with markdown lint + link-integrity (
scripts/ci/docs_quality_gate.sh). Bootstrap scripts: addbash -n install.sh.child.wait(), that timeout/error paths abort drain tasks, and that successful exits wait only for the bounded post-exit drain window before decoding captured output. The new tests cover large stdout pressure, stdout truncation, stderr truncation, and a Unix grandchild that keeps the pipe handle open after the main shell process exits.fmtcommand was crate-scoped because onlyzeroclaw-runtimechanged; the workspace-shaped clippy and CI-shaped nextest commands above cover the broader local readiness layer.Security & Privacy Impact (required)
Yes/No for each. Answer any
Yeswith a 1-2 sentence explanation.No)No)No)No)Yes, describe the risk and mitigation: None.Compatibility (required)
Yes)No)NoorYesto either: No upgrade steps required.Rollback (required for
risk: mediumandrisk: high)Low-risk PRs:
git revert <sha>is the plan unless otherwise noted.Medium/high-risk PRs must fill:
git revert <merge-commit-sha>Supersede Attribution (required only when
Supersedes #is used)#<pr> by @<author>, one per line): N/ACo-authored-bytrailers added in commit messages for incorporated contributors? (No)No, why (inspiration-only, no direct code/design carry-over): NoSupersedes #relationship is used. PR fix(shell): prevent hang when grandchild processes inherit pipe handlesΒ #6910 identified the same bug class, but this PR is a fresh narrow implementation.