Skip to content

fix(runtime): drain shell pipes while child runs#7935

Open
Audacity88 wants to merge 1 commit into
zeroclaw-labs:masterfrom
EyrieCommander:codex/issue-7871-shell-pipe-drain
Open

fix(runtime): drain shell pipes while child runs#7935
Audacity88 wants to merge 1 commit into
zeroclaw-labs:masterfrom
EyrieCommander:codex/issue-7871-shell-pipe-drain

Conversation

@Audacity88

Copy link
Copy Markdown
Collaborator

Summary

  • Base branch: master (all contributions)
  • What changed and why:
    • Start shell stdout/stderr drains immediately after process spawn instead of waiting for the child to exit first.
    • Keep draining after the capture cap so a large-output child cannot block on a full pipe.
    • Bound post-exit drain waiting so descendants that inherit stdout/stderr handles cannot keep the shell tool waiting for EOF indefinitely.
    • Preserve explicit stdout/stderr truncation markers and add regression coverage for large stdout, large stderr, and inherited-handle behavior.
  • Scope boundary: This PR does not change shell command policy, command allowlists, timeout defaults, approval behavior, environment filtering, sandbox selection, or stdin handling.
  • Blast radius: Shell tool execution in zeroclaw-runtime; callers that use the shell tool should see fewer timeout/hang failures for large output or descendant process trees that inherit output pipes.
  • Linked issue(s): Closes [Tracker]: shell tool can hang when grandchild processes inherit pipe handlesΒ #7871. Related fix(shell): prevent hang when grandchild processes inherit pipe handlesΒ #6910.
  • Labels: bug, risk: high, size: M, runtime, tool, tool:shell

Validation Evidence (required)

Local validation is the signal CI cannot replace. Run the full battery and paste literal output (tails, failures, warnings - not "all passed").

cargo fmt --all -- --check
cargo clippy --all-targets -- -D warnings
cargo test

Docs-only changes: replace with markdown lint + link-integrity (scripts/ci/docs_quality_gate.sh). Bootstrap scripts: add bash -n install.sh.

  • Commands run and tail output:
$ git diff --check
# exited 0

$ cargo fmt -p zeroclaw-runtime -- --check
# exited 0

$ cargo test -p zeroclaw-runtime shell_ --lib
test result: ok. 82 passed; 0 failed; 0 ignored; 0 measured; 2207 filtered out; finished in 1.76s

$ cargo clippy -p zeroclaw-runtime --all-targets -- -D warnings
# finished in 15m04s; exited 0

$ cargo clippy --workspace --exclude zeroclaw-desktop --all-targets --features ci-all -- -D warnings
# finished in 20m52s; exited 0

$ cargo nextest run --locked --workspace --exclude zeroclaw-desktop
Finished `test` profile [unoptimized + debuginfo] target(s) in 13m 04s
Nextest run ID f502090d-d58e-4578-94a6-d4e40075f887
Summary [81.264s] 9184 tests run: 9184 passed, 11 skipped
# exited 0
  • Beyond CI - what did you manually verify? Reviewed the shell execution path to confirm stdout and stderr drain tasks are started before child.wait(), that timeout/error paths abort drain tasks, and that successful exits wait only for the bounded post-exit drain window before decoding captured output. The new tests cover large stdout pressure, stdout truncation, stderr truncation, and a Unix grandchild that keeps the pipe handle open after the main shell process exits.
  • If any command was intentionally skipped, why: None for the local high-risk validation plan. The focused fmt command was crate-scoped because only zeroclaw-runtime changed; the workspace-shaped clippy and CI-shaped nextest commands above cover the broader local readiness layer.

Security & Privacy Impact (required)

Yes/No for each. Answer any Yes with a 1-2 sentence explanation.

  • New permissions, capabilities, or file system access scope? (No)
  • New external network calls? (No)
  • Secrets / tokens / credentials handling changed? (No)
  • PII, real identities, or personal data in diff, tests, fixtures, or docs? (No)
  • If any Yes, describe the risk and mitigation: None.

Compatibility (required)

  • Backward compatible? (Yes)
  • Config / env / CLI surface changed? (No)
  • If No or Yes to either: No upgrade steps required.

Rollback (required for risk: medium and risk: high)

Low-risk PRs: git revert <sha> is the plan unless otherwise noted.

Medium/high-risk PRs must fill:

  • Fast rollback command/path: git revert <merge-commit-sha>
  • Feature flags or config toggles: None.
  • Observable failure symptoms: Shell tool invocations with large stdout/stderr may time out or hang again; commands that spawn descendants may return to waiting for inherited stdout/stderr handles to close before the tool result is produced.

Supersede Attribution (required only when Supersedes # is used)

  • Superseded PRs + authors (#<pr> by @<author>, one per line): N/A
  • Scope materially carried forward: N/A
  • Co-authored-by trailers added in commit messages for incorporated contributors? (No)
  • If No, why (inspiration-only, no direct code/design carry-over): No Supersedes # relationship is used. PR fix(shell): prevent hang when grandchild processes inherit pipe handlesΒ #6910 identified the same bug class, but this PR is a fresh narrow implementation.

@github-actions github-actions Bot added runtime Auto scope: src/runtime/** changed. tool:shell labels Jun 18, 2026
@Audacity88 Audacity88 added bug Something isn't working risk: high Auto risk: security/runtime/gateway/tools/workflows. size: M Auto size: 251-500 non-doc changed lines. tool Auto scope: src/tools/** changed. labels Jun 18, 2026
@Audacity88 Audacity88 added this to the v0.8.1 milestone Jun 18, 2026
@Audacity88 Audacity88 marked this pull request as ready for review June 18, 2026 22:01
@Audacity88 Audacity88 requested a review from singlerider June 18, 2026 22:02
@Audacity88 Audacity88 modified the milestones: v0.8.1, v0.8.2, v0.8.3 Jun 19, 2026

@WareWolf-MoonWall WareWolf-MoonWall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #7935 β€” fix(runtime): drain shell pipes while child runs

Reviewer: WareWolf-MoonWall
Head: 09f89f5
CI: all green
Verdict: approve
Milestone: v0.8.3


🟒 Drain-while-running architecture addresses both failure modes correctly

Starting stdout/stderr drain tasks immediately after spawn β€” rather than after child.wait() β€” means a pipe-buffer-full condition can never stall the child process regardless of output volume. The bounded POST_EXIT_DRAIN window then handles the inherited-handle case without waiting indefinitely. Both problems are solved structurally, not papered over.

🟒 DrainOutput.truncated is set correctly in all overflow paths

The flag is set on both mid-read overflow (take < n) and on entry when the cap is already exhausted (remaining == 0). The drain task keeps running after hitting the cap β€” the child never blocks, and the caller gets accurate truncation reporting in both cases.

🟒 Async mutex usage is sound

Arc<std::sync::Mutex<DrainOutput>> is the right choice here: the lock is held only for brief synchronous pushes with no .await inside the critical section, so there is no risk of blocking the executor or deadlocking across await points.

🟒 Grandchild regression test closes the loop on inherited handles

shell_keeps_output_when_grandchild_holds_pipe_open is exactly the test that would have caught the original bug and will catch any future regression on that code path. Its inclusion makes the fix self-evidencing.

πŸ”΅ POST_EXIT_DRAIN timeout value warrants a doc comment

The constant governs a real operational tradeoff: too short and legitimate grandchildren producing output near exit lose data; too long and hung grandchildren delay tool results. If the value isn't already explained at its definition site, a brief comment stating the rationale (e.g. "N ms is sufficient for typical shell epilogue writes; grandchildren that outlive this window are treated as detached") would help future tuners make an informed adjustment.


Mechanically clean, all four regression tests are load-bearing, group_guard.disarm() placement and child.start_kill() on timeout are both correct. Approving.

@singlerider singlerider left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the single-file diff at head 09f89f5 and traced the drain lifecycle from spawn through exit, timeout, and the inherited-handle case. @WareWolf-MoonWall approved at this same head; I concur and have no new blockers. CI is green.

🟒 What looks good: concurrent draining is the real deadlock fix

Starting spawn_drain for stdout and stderr immediately after spawn, rather than reading only after child.wait() returns, is exactly what #7871 needed. A child that produces more than a pipe buffer of output blocks on the full pipe while the parent blocks on wait(); classic deadlock. Now the drains run concurrently and drain_capped_into keeps reading past the capture cap (it stops storing once cap is hit but continues consuming), so a verbose child can never wedge on a full pipe regardless of how much it writes.

🟒 What looks good: bounded post-exit drain handles inherited handles

finish_drain waits on the drain task with POST_EXIT_DRAIN and aborts on timeout, which is the right answer for a descendant that inherited the stdout/stderr fds: the pipe never sees EOF while the grandchild holds it open, so an unbounded wait would hang the shell tool forever. shell_keeps_output_when_grandchild_holds_pipe_open pins that the tool still returns the captured output instead of hanging. Reusing the existing POST_EXIT_DRAIN bound rather than inventing a new timeout keeps the behavior consistent with the prior post-exit path.

🟒 What looks good: capture accounting and attribution are correct

DrainOutput.truncated is set both when a chunk is partially taken (take < n) and when the cap is already reached, and the truncation markers are appended on either truncated or a length overflow, so the "[output truncated at 1MB]" / "[stderr truncated at 1MB]" markers are preserved. The drain tasks are launched via zeroclaw_spawn::spawn!, so they inherit the turn's attribution span rather than logging un-attributed. shell_marks_stdout_truncated_after_limit / ..._stderr_... cover both marker paths.

Approving.

@tidux tidux left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked out 09f89f5 locally, walked the new drain lifecycle against the consumers in crates/zeroclaw-runtime/src/tools/shell.rs, and ran the focused shell test slice. @WareWolf-MoonWall approved at this head on 2026-06-19, @singlerider approved on 2026-06-22, and I concur β€” no new blockers.

βœ… Resolved β€” drain-while-running closes both failure modes from #7871

spawn_drain returns a DrainHandle that owns a zeroclaw_spawn::spawn!-launched task plus a shared Arc<Mutex<DrainOutput>> (shell.rs:481-499). The task starts reading immediately after spawn, before child.wait().await, so a pipe-buffer-full producer cannot wedge the child regardless of output volume. After exit, finish_drain waits up to POST_EXIT_DRAIN and aborts on timeout, which is exactly the inherited-grandchild-handle answer #7871 acceptance criteria call for. shell_drains_large_stdout_while_child_runs (200_000 bytes) and shell_keeps_output_when_grandchild_holds_pipe_open (printf done; (sleep 1) &) pin both shapes as regressions.

βœ… Resolved β€” truncation accounting is honest under the new boundary

The cap logic in drain_capped_into (shell.rs:531-557) is a real improvement on the prior cap.saturating_sub(buf.len()).max(1) path, which had a subtle "store one extra byte at cap" overshoot. The new flow stops storing the moment remaining == 0 and flags truncated = true; partial-take on the cap-crossing iteration also sets truncated |= take < n. The combined check in the success branch (if stdout_capture.truncated || stdout.len() > MAX_OUTPUT_BYTES) covers both the drain-side flag and the post-decode length guard, so the truncation markers are preserved in both overflow shapes. shell_marks_stdout_truncated_after_limit / ..._stderr_... pin both marker paths.

🟒 Cleanup discipline on every exit branch

Ok(Err(e)) and Err(_) both call tokio::join!(abort_drain(stdout_drain), abort_drain(stderr_drain)) so detached drain tasks cannot outlive the tool invocation. The timeout branch additionally calls child.start_kill() before aborting, and the ChildGroupGuard drop still SIGKILLs the process group on every non-success path because group_guard.disarm() is only called on Ok(Ok(status)). Cancel/timeout/error semantics are unchanged from the prior implementation; only the read ordering moved.

🟒 Spawn macro is the sanctioned one

The drain tasks go through zeroclaw_spawn::spawn!, so they inherit the caller's attribution_span and emit runtime.task.spawn lifecycle records exactly like the rest of the workspace. A bare tokio::spawn here would have orphaned the spans for two long-running tasks per shell invocation β€” easy mistake to make in a high-risk runtime file, didn't happen here.

🟒 Test design proves the contract, not the implementation

The four new tests drive the public ShellTool::execute API with real subprocesses and assert on the user-observable shape (result.output.len() == 200_000, marker suffix on result.output / result.error, result.output.contains("done")). They would fail if anyone reverted to read-after-wait or stripped the truncation markers, which is exactly the contract that needs pinning.

πŸ”΅ Suggestion β€” second on @WareWolf-MoonWall's POST_EXIT_DRAIN doc-comment ask

Confirmed at shell.rs:14 the constant is still a bare const POST_EXIT_DRAIN: Duration = Duration::from_millis(250); with no /// comment. Wolf's reasoning holds β€” the value is a real operational tradeoff between losing late grandchild output and waiting on detached descendants. A one-line /// stating "bounded so descendants that inherit pipe handles cannot stall the tool indefinitely; values larger than ~1s risk noticeable per-invocation latency for normal commands" would help the next tuner. Non-blocking; happy to land it in a follow-up if the author would rather not amend before merge.


Local verification at 09f89f5 (file checked out from pr-7935 over feat/tui-beautification worktree, restored after):

  • cargo fmt -p zeroclaw-runtime -- --check clean
  • cargo clippy -p zeroclaw-runtime --all-targets -- -D warnings clean (1m05s)
  • cargo test -p zeroclaw-runtime --lib shell_ β†’ 82 passed; 0 failed; 0 ignored; 2221 filtered; finished in 0.43s β€” matches the PR body's 82 passed / 0 failed shape
  • git merge-tree --write-tree pr-7935 upstream/master β†’ clean tree, no conflicts against current master (8df6b8a)
  • All four new tests (shell_drains_large_stdout_while_child_runs, shell_marks_stdout_truncated_after_limit, shell_marks_stderr_truncated_after_limit, shell_keeps_output_when_grandchild_holds_pipe_open) pass under #[cfg(unix)]

Cross-checked the consumer audit on the Ok(Err(e)) path: the Failed to execute command shape is preserved byte-for-byte, only the new abort_drain join was added before constructing ToolResult. The with_ephemeral_workspace_warning injection on both result.output and result.error is unchanged. No behavior regression for the unhappy paths.

Single GPG-unsigned commit, matches @Audacity88's existing pattern on this identity (#8068, #8184 same shape this session). Closes #7871 is the right tracker β€” the issue is itself v0.8.3-scoped with priority:p1, and the acceptance criteria there map 1:1 to what shipped. PR is already listed in #8071 ("[Tracker]: v0.8.3 runtime, agent, tools, and execution stability") at line 62 of the auto-rebuilt body, so no tracker edit needed.

Approving. Ready to merge at maintainer discretion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working risk: high Auto risk: security/runtime/gateway/tools/workflows. runtime Auto scope: src/runtime/** changed. size: M Auto size: 251-500 non-doc changed lines. tool:shell tool Auto scope: src/tools/** changed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Tracker]: shell tool can hang when grandchild processes inherit pipe handles

4 participants