Skip to content

fix(lifecycle): drain StreamManager goroutines in tests#1227

Open
jcfs wants to merge 1 commit into
mainfrom
feature/fix-lifecycle-goleak-xdu
Open

fix(lifecycle): drain StreamManager goroutines in tests#1227
jcfs wants to merge 1 commit into
mainfrom
feature/fix-lifecycle-goleak-xdu

Conversation

@jcfs
Copy link
Copy Markdown
Contributor

@jcfs jcfs commented Jun 1, 2026

Summary

  • Fixes CI-only flake where the internal/agent/runtime/lifecycle goleak.VerifyTestMain would fail on slow runners with leaked StreamManager.connectWorkspaceStream + WorkspaceStream.writeLoop / read loop goroutines, even though every individual test passed (see PR feat: full GitLab integration — parity with GitHub #1120 run 26745440163 job 78819616867).
  • Makes agentctl.Client.Close() an absolute drain barrier — tracks every stream goroutine it spawns and waits for them, plus a closed flag that rejects new StreamUpdates / StreamWorkspace dials so a Close racing an in-flight dial cannot strand a fresh WS connection. The restart path in manager_interaction.go switches to per-stream Close helpers so it can keep reusing the client after teardown.
  • Makes StreamManager.Wait() an absolute drain barrier — closes a new internal waitCh that the retry backoff and the connected <-ws.Done() / <-stop> select observe, so drain doesn't depend on the caller closing the external stopCh first. streamContext carries both stops through stopChannelContext so in-flight WS dials cancel on either signal.
  • Local hardware doesn't reliably reproduce the leak. Added a make test-lifecycle-goleak LIFECYCLE_GOLEAK_COUNT=N target (defaults to 20) as the repro hook for the flake, plus a TestClientClose_DrainsStreamGoroutines regression test in the agentctl package.

Test plan

  • make test-lifecycle-goleak LIFECYCLE_GOLEAK_COUNT=20 — 20/20 clean under -race
  • go test -race -count=1 ./internal/agent/runtime/lifecycle/... ./internal/agent/runtime/agentctl/...
  • go vet ./...
  • go build ./...
  • CI Run Backend Tests passes on first attempt

🤖 Generated with Claude Code

Preview Environment

URL https://kandev-pr-1227-bwo7.sprites.app
Commit 0bcdc06
Agent Mock agent

Updates automatically on each push. Destroyed when the PR is closed.

Goroutines spawned by `agentctl.Client.StreamUpdates` /
`StreamWorkspace` and by `lifecycle.StreamManager.connectWorkspaceStream`
could outlive the tests that created them on slow CI runners, causing
intermittent `goleak.VerifyTestMain` failures in
`internal/agent/runtime/lifecycle` even though every individual test
passed. The leak required the test to race: `client.Close()` could
return while a workspace dial was still in flight, leaving the
just-spawned WS read/write loops with nobody to drain them, and
`StreamManager.Wait()` only fired the drain when the external `stopCh`
had been closed first.

Make the close paths absolute drain barriers:

- `Client.Close()` now tracks every stream goroutine it spawns
  (`streamWG` + a per-stream `WorkspaceStream` reference) and blocks
  until all of them have exited. A `closed` flag flipped under the
  client mutex makes subsequent `StreamUpdates` / `StreamWorkspace`
  calls reject the dial, so a Close that races a dial in flight cannot
  leave behind a stranded WS connection. `manager_interaction.go`
  switches the restart path to per-stream Close helpers so it can keep
  reusing the client after teardown.
- `StreamManager.Wait()` closes a new internal `waitCh` that the retry
  backoff (`sleepOrStop`) and the connected `<-ws.Done() / <-stop>`
  select also observe. This makes `Wait` an absolute drain barrier
  regardless of whether the caller ever closes the external `stopCh`.
  `streamContext` carries both stops through `stopChannelContext` so
  in-flight WebSocket dials cancel on either signal.

Local stress (`make test-lifecycle-goleak LIFECYCLE_GOLEAK_COUNT=20`)
passes 20/20 with `-race` after the fix. The new
`TestClientClose_DrainsStreamGoroutines` covers the drain barrier and
the post-Close rejection path. Added a `test-lifecycle-goleak`
Makefile target as the repro hook for the CI-only flake.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @jcfs's task in 6m 43s —— View job


Code Review

Todo:

  • Gather context and understand the PR
  • Review streams.go (lifecycle StreamManager drain barrier)
  • Review client.go + workspace_stream.go (agentctl Client.Close drain)
  • Review agent.go and manager_interaction.go
  • Review client_close_test.go (regression test)
  • Review Makefile changes
  • Post findings

Overall Assessment

Solid, well-reasoned fix for the CI goleak flake. The two-pronged approach — closing waitCh in sm.Wait() so the SM goroutines drain without depending on external stopCh, and adding streamWG + closed guard to Client.Close() so it becomes a true drain barrier — directly addresses the two independent race windows described in the PR. The stopChannelContext rewrite is careful and correct. Most findings below are documentation/minor correctness notes; there are no blocking issues.


Findings

client.gostreamWG comment misrepresents workspace goroutine tracking
Inline commentstreamWG only tracks readUpdatesStream; workspace goroutines are on stream.wg and drained via explicit ws.Wait(), not "transitively" through streamWG.

client.go — double-close path in Client.Close()
Inline commentCloseWorkspaceStream() closes the raw *websocket.Conn outside closeOnce, then ws.Close() tries to close it again (logged at Debug). Harmless in practice, but the comment "idempotent via closeOnce" is slightly misleading for this call site. Worth either removing CloseWorkspaceStream() from Close() or adding a note explaining both calls serve different purposes (read-loop exit vs. write-loop wakeup).

streams.gomergeStops goroutine always spawned
Inline comment — Since secondary (= waitCh) is never nil, Done() always spawns mergeStops, including for connectUpdatesStream's context, which returns immediately after the dial. The goroutine is bounded and correct (exits on Wait()), but the primary == nil && secondary == nil early-return in Done() is now dead code. A short note would help future readers not flag this as a leak.

workspace_stream.go — post-dial section doesn't guard concurrent StreamWorkspace callers
Inline comment — Pre-existing gap: the post-dial second-lock check guards only closed, not workspaceStreamConn != nil. Two goroutines that both pass the pre-dial check could both dial and the second would overwrite workspaceStream, orphaning the first conn's goroutines. StreamManager never calls StreamWorkspace concurrently in practice, so this isn't a regression, but adding the check here would seal it for free.

client_close_test.gomock.connected semantics
Inline commentm.once fires on the first connection, not both. Both client goroutines are already live before this wait (both dials returned), so the drain works correctly. A clarifying comment would prevent future readers from reading the once as "both connections are live on the server side."


Minor observations (not inline)

  • manager_interaction.go: the switch from execution.agentctl.Close() to CloseUpdatesStream() + CloseWorkspaceStream() is correct and the comment explains the reasoning well.
  • Makefile: test-lifecycle-goleak target is a useful addition. The CGO_PREFIX variable on the go test line is consistent with the rest of the file.
  • The streamContext always returns a *stopChannelContext now (even for test callers that previously got a plain context). This is intentional — secondary (waitCh) provides the drain signal — but note that every dial now spawns a mergeStops goroutine until sm.Wait().

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • Hardened WebSocket stream connections against shutdown races to prevent goroutine leaks.
    • Improved agent client lifecycle management with coordinated shutdown and drain barriers.
    • Enhanced stream setup to reliably detect and handle client closure during connection operations.
  • Tests

    • Added regression test to verify proper goroutine cleanup during client shutdown.
    • Added CI target for stress-testing lifecycle integrity and resource leak detection.

Walkthrough

This PR hardens goroutine lifecycle management in agent control and stream orchestration to prevent leaks during client shutdown. It adds dual-level drain barriers (Client and StreamManager), detects and handles dial-time races, and introduces stress-testing infrastructure to validate the fixes.

Changes

Goroutine Lifecycle Hardening and Drain Barriers

Layer / File(s) Summary
CI test infrastructure for goroutine leak detection
apps/backend/Makefile
New test-lifecycle-goleak Make target runs lifecycle test suites with -race and configurable repetition (LIFECYCLE_GOLEAK_COUNT, default 20) to stress-test for goroutine leaks in CI.
Client drain-barrier data structures and Close coordination
apps/backend/internal/agent/runtime/agentctl/client.go
Client struct gains closed flag, workspaceStream reference, and streamWG waitgroup to track spawned goroutines. Close() becomes a coordinated drain flow: mark closed, snapshot workspace stream, close both stream types, wait the workspace stream explicitly, then block on streamWG for full goroutine exit.
StreamUpdates hardening with dial-race detection
apps/backend/internal/agent/runtime/agentctl/agent.go
StreamUpdates acquires lock before dial to fail fast if client closed; after dial re-acquires lock to detect and close the connection if client was closed during the operation. Reader goroutine wraps with defer c.streamWG.Done() to guarantee waitgroup completion.
StreamWorkspace hardening with dial-race detection
apps/backend/internal/agent/runtime/agentctl/workspace_stream.go
StreamWorkspace adds early c.closed check before dial. After dial, re-locks to detect client closure during dial and closes newly-created connection on race. Cleanup deferred function clears both workspaceStreamConn and workspaceStream when they match the current stream.
Regression test for Client.Close() goroutine draining
apps/backend/internal/agent/runtime/agentctl/client_close_test.go
Introduces WebSocket barrier mock server and test that spawns both StreamUpdates and StreamWorkspace goroutines, waits for connectivity, asserts Close() completes within 2 seconds, and verifies subsequent stream calls fail post-Close to prevent goroutine leakage past goleak checks.
StreamManager internal drain barrier and context merging
apps/backend/internal/agent/runtime/lifecycle/streams.go
StreamManager introduces internal waitCh and waitChOnce as a "drain barrier" closed by Wait(). Replaces single-channel stopChannelContext with merged two-stop-channel context (external stopCh + internal waitCh) using sync.Once. Updates Wait() to close waitCh before wg.Wait(), sleepOrStop() to always respect waitCh, and connectWorkspaceStream shutdown select to handle both nil and present external stopCh uniformly.
Restart path selective stream closure
apps/backend/internal/agent/runtime/lifecycle/manager_interaction.go
RestartAgentProcess closes only agentctl streams via CloseUpdatesStream() and CloseWorkspaceStream() instead of calling full agentctl.Close(), avoiding a terminal drain barrier that would block subsequent stream operations within the same restart flow.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • kdlbs/kandev#1139: Modifies the same streams.go lifecycle management and drain-barrier/stop handling logic, with overlapping changes to connectWorkspaceStream and sleepOrStop goroutine-leak prevention.

Poem

🐰 In channels deep where goroutines dance free,
We close the gates with waitgroups so spry,
Race detectors catch leaks for all to see,
Drain barriers ensure no threads slip by,
A safer shutdown makes the codebase sing! 🎵

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main objective: fixing goroutine draining in the StreamManager lifecycle tests. It is concise, specific, and clearly conveys the primary change.
Description check ✅ Passed The description comprehensively covers the problem, solution, and validation. It includes a clear summary explaining the CI flake fix, important changes detailing the drain barrier implementations, and verification steps with test results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/fix-lifecycle-goleak-xdu

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.


// streamWG tracks every stream goroutine spawned by this client
// (readUpdatesStream and, transitively via WorkspaceStream.wg, the
// workspace read/write loops). Close waits on it so callers get a true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here says streamWG tracks workspace goroutines "transitively via WorkspaceStream.wg", but that's not accurate. streamWG only tracks readUpdatesStream. Workspace goroutines live on the per-stream stream.wg and are drained in Close() via the explicit ws.Wait() call — not through streamWG at all. Suggest wording like:

// streamWG tracks the updates-stream goroutine (readUpdatesStream) spawned
// by this client. Workspace read/write goroutines are tracked on each
// WorkspaceStream's own wg and drained via ws.Wait() in Close().

c.CloseUpdatesStream()
c.CloseWorkspaceStream()

// Wait for the workspace stream's read/write goroutines (closeOnce makes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CloseWorkspaceStream() (line 686) closes c.workspaceStreamConn directly — outside closeOnce. Then ws.Close() (line 692) enters closeOnce.Do and tries to close the same underlying conn a second time. The closeOnce only guards against multiple ws.Close() calls; it doesn't prevent CloseWorkspaceStream() from pre-closing the raw conn first.

In practice this is harmless — gorilla websocket's Close on an already-closed connection returns an error that is logged at Debug level — but the comment "ws.Close is idempotent via closeOnce" is slightly misleading here because the first close doesn't go through closeOnce. Consider either:

  1. Removing the CloseWorkspaceStream() call from Close() and relying solely on ws.Close() + ws.Wait() to tear down the workspace stream, or
  2. Adding a note that CloseWorkspaceStream triggers the read-loop exit while ws.Close wakes the write-loop, so both are needed.


func (c stopChannelContext) Done() <-chan struct{} {
if c.stopCh == nil {
func (c *stopChannelContext) Done() <-chan struct{} {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

secondary (sm.waitCh) is always non-nil (set unconditionally in NewStreamManager), so the c.primary == nil && c.secondary == nil early-return path is dead code. mergeStops will now be spawned for every WebSocket dial attempt — including connectUpdatesStream's one-shot dial. Each mergeStops goroutine persists until waitCh (or stopCh) fires, which means a goroutine that lives from the moment the dial completes until sm.Wait() is called.

That's fine in production (always a stopCh) and in well-behaved tests (always a Wait()), but it's worth a comment noting that the goroutine is intentional and bounded, so a future reader doesn't flag it as a leak. Also worth noting that the mergeStops goroutine from connectUpdatesStream's ctx outlives connectUpdatesStream itself (which returns immediately after the dial), pinning the stopChannelContext alive until teardown.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
apps/backend/internal/agent/runtime/lifecycle/streams.go (1)

432-449: 💤 Low value

Consider consolidating the duplicate select branches.

The shutdown select logic duplicates the ws.Done() and waitCh cases. Since stopCh can be selected on even when nil (a receive on nil channel blocks forever), you could simplify to a single select block:

select {
case <-ws.Done():
case <-sm.stopCh:
    shutdown()
case <-sm.waitCh:
    shutdown()
}

A receive on a nil channel never proceeds, so when stopCh is nil, only ws.Done() and waitCh are effective. This removes the conditional branching.

That said, the current explicit nil-check approach is clearer about intent and avoids relying on nil-channel semantics, so this is purely stylistic.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/internal/agent/runtime/lifecycle/streams.go` around lines 432 -
449, The duplicate select branches can be consolidated: remove the if sm.stopCh
== nil conditional and replace both branches with one select that listens for
<-ws.Done(), <-sm.stopCh, and <-sm.waitCh, calling the existing shutdown()
function in the latter two cases; relying on Go's nil-channel semantics (receive
on nil blocks) will make the <-sm.stopCh case inert when sm.stopCh is nil while
preserving behavior for ws.Done() and sm.waitCh.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@apps/backend/internal/agent/runtime/lifecycle/streams.go`:
- Around line 432-449: The duplicate select branches can be consolidated: remove
the if sm.stopCh == nil conditional and replace both branches with one select
that listens for <-ws.Done(), <-sm.stopCh, and <-sm.waitCh, calling the existing
shutdown() function in the latter two cases; relying on Go's nil-channel
semantics (receive on nil blocks) will make the <-sm.stopCh case inert when
sm.stopCh is nil while preserving behavior for ws.Done() and sm.waitCh.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 82a94bf9-322b-4b5b-80e6-124e0f5ada93

📥 Commits

Reviewing files that changed from the base of the PR and between 1fdeda9 and 0bcdc06.

📒 Files selected for processing (7)
  • apps/backend/Makefile
  • apps/backend/internal/agent/runtime/agentctl/agent.go
  • apps/backend/internal/agent/runtime/agentctl/client.go
  • apps/backend/internal/agent/runtime/agentctl/client_close_test.go
  • apps/backend/internal/agent/runtime/agentctl/workspace_stream.go
  • apps/backend/internal/agent/runtime/lifecycle/manager_interaction.go
  • apps/backend/internal/agent/runtime/lifecycle/streams.go

// Race: Close may have fired between the dial returning and us re-acquiring
// the lock. Drop the new conn + stream instead of leaking the read/write
// goroutines past Client.Close's drain barrier.
c.mu.Lock()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on this race. One minor gap: the post-dial check here guards against closed but not against a second concurrent StreamWorkspace call that also passed the pre-dial workspaceStreamConn != nil check (both raced through the unlock on line 56). If two goroutines both dial and both reach this lock, the second one overwrites workspaceStreamConn / workspaceStream without closing the first connection, orphaning the first conn and its goroutines.

This is a pre-existing issue (the original code didn't check post-dial either), and the StreamManager never calls StreamWorkspace concurrently, so it's not a regression. But since this PR is already touching the post-dial section, adding if c.workspaceStreamConn != nil { conn.Close(); return nil, ... } here would seal the gap completely.

// goroutine here would block Close forever (or, pre-fix, return early
// and leave the goroutine running past goleak's check).
done := make(chan struct{})
go func() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m.connected fires on the first server-side WS accept (via sync.Once). Since both StreamUpdates and StreamWorkspace return successfully before this wait, both dials are already complete and both client goroutines are live by the time we reach this select. The wait is a valid sanity check but doesn't actually guarantee the second server-side handler is in its blocking loop yet.

Functionally this is fine — streamWG.Add(1) and stream.wg.Add(2) are registered before the goroutines start, so Close() will drain all of them regardless. But a comment explaining the intent ("both client goroutines are already started; waiting on mock.connected is a belt-and-suspenders check that the read side of the socket is live before we call Close") would prevent future readers from misreading the once as "all connections are live".

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 1, 2026

Greptile Summary

This PR fixes a CI-only goleak.VerifyTestMain flake caused by StreamManager.connectWorkspaceStream and WorkspaceStream goroutines outliving test teardown on slow runners. It makes both Client.Close() and StreamManager.Wait() true drain barriers by tracking goroutines in explicit WaitGroups and introducing an internal waitCh that closes when Wait() is called, allowing in-flight retries and the connected-stream select to drain without depending on the external stopCh.

  • Client.Close() now sets a closed flag (with double-check-locking across the dial), adds the updates-stream goroutine to streamWG, and sequentially calls ws.Wait() then streamWG.Wait() before returning.
  • StreamManager.Wait() closes a per-manager waitCh that sleepOrStop, the workspace-stream select, and the new stopChannelContext merge goroutine all observe; RestartAgentProcess switches to per-stream CloseUpdatesStream/CloseWorkspaceStream so the client stays usable across restarts.
  • A TestClientClose_DrainsStreamGoroutines regression test and a make test-lifecycle-goleak Makefile target are added to surface the flake locally.

Confidence Score: 4/5

The drain logic is well-structured and the critical race conditions are correctly handled with double-check locking. The main concern is narrow and bounded by goroutine scheduler timing.

The mergeStops goroutine from stopChannelContext.Done() is not tracked in sm.wg, leaving a narrow window where sm.wg.Wait() can return while a merge goroutine is still finishing in the connectUpdatesStream path. All other drain invariants are correct.

apps/backend/internal/agent/runtime/lifecycle/streams.go deserves a second look around the stopChannelContext.mergeStops goroutine lifetime.

Important Files Changed

Filename Overview
apps/backend/internal/agent/runtime/agentctl/client.go Adds workspaceStream, closed, and streamWG fields; makes Close() a true drain barrier that sets the closed flag, waits on ws.Wait() for workspace goroutines, and then streamWG.Wait() for the updates goroutine. Race guards (double-check under lock) for both StreamUpdates and StreamWorkspace are correct.
apps/backend/internal/agent/runtime/agentctl/agent.go Adds pre-dial and post-dial c.closed checks around StreamUpdates; streamWG.Add(1) is done inside the write lock so Close() cannot race it; goroutine wrapper calls streamWG.Done() on exit. Logic is correct.
apps/backend/internal/agent/runtime/agentctl/workspace_stream.go Adds c.closed guard before and after dial; stores c.workspaceStream = stream inside the post-dial lock; readWorkspaceStream defer conditionally clears c.workspaceStream (identity check) to avoid clobbering a newer stream during restart.
apps/backend/internal/agent/runtime/lifecycle/streams.go Adds waitCh/waitChOnce for an internal drain signal; refactors stopChannelContext to merge two stop channels via a lazily-spawned mergeStops goroutine. The mergeStops goroutine is not tracked in sm.wg, creating a narrow theoretical window after sm.wg.Wait() returns for the connectUpdatesStream path.
apps/backend/internal/agent/runtime/agentctl/client_close_test.go New regression test for the Close drain guarantee. mock.connected sync.Once fires only for the first of two WS connections, so the comment overstates what is being synchronized.
apps/backend/internal/agent/runtime/lifecycle/manager_interaction.go Switches RestartAgentProcess from agentctl.Close() to CloseUpdatesStream() + CloseWorkspaceStream() so the client is reusable after teardown.
apps/backend/Makefile Adds test-lifecycle-goleak target with configurable LIFECYCLE_GOLEAK_COUNT and help entry.

Reviews (1): Last reviewed commit: "fix(lifecycle): drain StreamManager goro..." | Re-trigger Greptile

Comment on lines +71 to 76
c.once.Do(func() {
c.merged = make(chan struct{})
go c.mergeStops()
})
return c.merged
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 mergeStops goroutine not covered by sm.wg

stopChannelContext.Done() lazily spawns a mergeStops goroutine via sync.Once. For connectWorkspaceStream the wrapper goroutine stays alive long enough (blocked in ws.Wait()) that the merge goroutine always exits before sm.wg.Wait() returns — safe. For connectUpdatesStream though, the wrapper goroutine exits immediately after StreamUpdates returns, so sm.wg.Wait() can return before the merge goroutine finishes closing c.merged. The PR's stated guarantee is that Wait() is an "absolute drain barrier," but this goroutine isn't part of it. goleak's grace period will almost always cover the nanosecond window, but under tight-loop stress tests (-count=20 with -race) a scheduler pause on a slow runner is exactly the scenario this PR is trying to harden against.

Comment on lines +105 to +112
select {
case <-mock.connected:
case <-time.After(2 * time.Second):
t.Fatal("mock server never observed a WS connection")
}

// Close must return promptly and have drained both streams. A hung
// goroutine here would block Close forever (or, pre-fix, return early
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 mock.connected only fires for the first of two WS connections

The closeBarrierMockServer uses sync.Once to close m.connected, so the channel fires after the very first server-side handler starts — not after both the /api/v1/agent/stream and /api/v1/workspace/stream handlers are running. The comment "Wait for both server-side WS handlers to register" is therefore inaccurate. Both client-side goroutines are already live by the time this select is reached (both dials returned successfully), so the drain assertion is still valid, but using a WaitGroup or counting two connections in the mock would make the test's intent match its implementation and guard against future changes that call these two endpoints in a different sequence.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 7 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/backend/internal/agent/runtime/lifecycle/streams.go">

<violation number="1" location="apps/backend/internal/agent/runtime/lifecycle/streams.go:73">
P2: Per-stream `Done()` now spawns a goroutine that can outlive the stream. Normal `ws.Done()` exits do not close `stopCh`/`waitCh`, so reconnect cycles can pile idle `mergeStops` goroutines until full manager shutdown. Tie cancellation to stream lifetime or reuse a shared manager-level merged channel.</violation>
</file>

<file name="apps/backend/internal/agent/runtime/agentctl/workspace_stream.go">

<violation number="1" location="apps/backend/internal/agent/runtime/agentctl/workspace_stream.go:80">
P2: Re-check `workspaceStreamConn` after the dial and before assigning it. Two concurrent callers can both pass the pre-dial guard, and the later one will overwrite the first stream reference without closing the first connection.</violation>

<violation number="2" location="apps/backend/internal/agent/runtime/agentctl/workspace_stream.go:121">
P1: Clear `workspaceStreamConn` only for the same stream. Old read goroutine can zero out a newer connection pointer. Guard the conn reset like you guarded `workspaceStream`.</violation>
</file>

<file name="apps/backend/internal/agent/runtime/agentctl/client.go">

<violation number="1" location="apps/backend/internal/agent/runtime/agentctl/client.go:54">
P3: This comment overstates what `streamWG` tracks. It only tracks the updates-stream goroutine directly; workspace read/write loops are tracked on `WorkspaceStream.wg` and drained via `ws.Wait()` in `Close()`.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

defer func() {
c.mu.Lock()
c.workspaceStreamConn = nil
if c.workspaceStream == stream {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Clear workspaceStreamConn only for the same stream. Old read goroutine can zero out a newer connection pointer. Guard the conn reset like you guarded workspaceStream.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/backend/internal/agent/runtime/agentctl/workspace_stream.go, line 121:

<comment>Clear `workspaceStreamConn` only for the same stream. Old read goroutine can zero out a newer connection pointer. Guard the conn reset like you guarded `workspaceStream`.</comment>

<file context>
@@ -105,6 +118,9 @@ func (c *Client) readWorkspaceStream(conn *websocket.Conn, stream *WorkspaceStre
 	defer func() {
 		c.mu.Lock()
 		c.workspaceStreamConn = nil
+		if c.workspaceStream == stream {
+			c.workspaceStream = nil
+		}
</file context>

return c.stopCh
c.once.Do(func() {
c.merged = make(chan struct{})
go c.mergeStops()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Per-stream Done() now spawns a goroutine that can outlive the stream. Normal ws.Done() exits do not close stopCh/waitCh, so reconnect cycles can pile idle mergeStops goroutines until full manager shutdown. Tie cancellation to stream lifetime or reuse a shared manager-level merged channel.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/backend/internal/agent/runtime/lifecycle/streams.go, line 73:

<comment>Per-stream `Done()` now spawns a goroutine that can outlive the stream. Normal `ws.Done()` exits do not close `stopCh`/`waitCh`, so reconnect cycles can pile idle `mergeStops` goroutines until full manager shutdown. Tie cancellation to stream lifetime or reuse a shared manager-level merged channel.</comment>

<file context>
@@ -31,49 +31,104 @@ type StreamManager struct {
-	return c.stopCh
+	c.once.Do(func() {
+		c.merged = make(chan struct{})
+		go c.mergeStops()
+	})
+	return c.merged
</file context>

Comment on lines +80 to +81
c.workspaceStreamConn = conn
c.workspaceStream = stream
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Re-check workspaceStreamConn after the dial and before assigning it. Two concurrent callers can both pass the pre-dial guard, and the later one will overwrite the first stream reference without closing the first connection.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/backend/internal/agent/runtime/agentctl/workspace_stream.go, line 80:

<comment>Re-check `workspaceStreamConn` after the dial and before assigning it. Two concurrent callers can both pass the pre-dial guard, and the later one will overwrite the first stream reference without closing the first connection.</comment>

<file context>
@@ -57,19 +61,28 @@ func (c *Client) StreamWorkspace(ctx context.Context, callbacks WorkspaceStreamC
+		_ = conn.Close()
+		return nil, fmt.Errorf("agentctl client closed during workspace stream dial")
+	}
+	c.workspaceStreamConn = conn
+	c.workspaceStream = stream
+	c.mu.Unlock()
</file context>
Suggested change
c.workspaceStreamConn = conn
c.workspaceStream = stream
if c.workspaceStreamConn != nil {
c.mu.Unlock()
_ = conn.Close()
return nil, fmt.Errorf("workspace stream already connected")
}
c.workspaceStreamConn = conn
c.workspaceStream = stream

Comment on lines +54 to +57
// streamWG tracks every stream goroutine spawned by this client
// (readUpdatesStream and, transitively via WorkspaceStream.wg, the
// workspace read/write loops). Close waits on it so callers get a true
// drain barrier instead of a fire-and-forget conn.Close.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: This comment overstates what streamWG tracks. It only tracks the updates-stream goroutine directly; workspace read/write loops are tracked on WorkspaceStream.wg and drained via ws.Wait() in Close().

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/backend/internal/agent/runtime/agentctl/client.go, line 54:

<comment>This comment overstates what `streamWG` tracks. It only tracks the updates-stream goroutine directly; workspace read/write loops are tracked on `WorkspaceStream.wg` and drained via `ws.Wait()` in `Close()`.</comment>

<file context>
@@ -38,11 +38,25 @@ type Client struct {
 	// Shared write mutex for agent stream (used by StreamUpdates and sendStreamRequest)
 	streamWriteMu sync.Mutex
 
+	// streamWG tracks every stream goroutine spawned by this client
+	// (readUpdatesStream and, transitively via WorkspaceStream.wg, the
+	// workspace read/write loops). Close waits on it so callers get a true
</file context>
Suggested change
// streamWG tracks every stream goroutine spawned by this client
// (readUpdatesStream and, transitively via WorkspaceStream.wg, the
// workspace read/write loops). Close waits on it so callers get a true
// drain barrier instead of a fire-and-forget conn.Close.
// streamWG tracks the updates-stream goroutine (readUpdatesStream)
// spawned by this client. Workspace read/write loops are tracked on
// each WorkspaceStream's own wg and drained via ws.Wait() in Close().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant