fix(agentctl): order workspace stream wg.Add before publish to fix Close race#1272
Conversation
…ose race Client.StreamWorkspace published the new WorkspaceStream pointer to c.workspaceStream under the lock, then released the lock and called stream.wg.Add(2) afterwards. A concurrent Client.Close captured c.workspaceStream under the same lock and then called ws.Wait() outside of it, racing the subsequent Add(2) on a WaitGroup whose counter was still 0. The race detector flagged this as Add-after-Wait on the same counter address — reproduced in CI in TestInitializeAndPrompt_StreamBeforeInitialize and TestManager_RestartAgentProcess_StopErrorIsNonFatal, both of which use a deferred client.Close after the StreamManager spawns the workspace stream goroutine. Move stream.wg.Add(2) under the same lock that publishes the pointer. The unlock/lock pair now carries the Add into Close's view, so any caller that observes c.workspaceStream is guaranteed to see the counter already bumped before invoking Wait. Adds TestClientClose_StreamWorkspaceAddWaitRace as a deterministic regression: it waits for the server to observe the WS upgrade, then fires Close in parallel with StreamWorkspace — pre-fix this triggers the race within a handful of iterations under -race. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Claude finished @jcfs's task in 3m 6s —— View job Code Review
SummaryThe fix is correct and well-reasoned. Moving Correctness walkthrough:
Two minor test observations (see inline comments):
Verdict: Approve. The race fix is minimal, correct, and explained clearly. The regression test directly targets the failing CI window and is a good addition. |
📝 WalkthroughWalkthroughThis PR fixes a race condition in the ChangesStreamWorkspace synchronization race fix
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
| Filename | Overview |
|---|---|
| apps/backend/internal/agent/runtime/agentctl/workspace_stream.go | Moves wg.Add(2) under the publication lock to fix Add-vs-Wait data race; the fix is minimal, correct, and well-commented. |
| apps/backend/internal/agent/runtime/agentctl/client_close_test.go | Adds a targeted regression test that reproduces the exact CI race window; test structure, timeouts, and iteration count are well-suited for the -race detector. |
Sequence Diagram
sequenceDiagram
participant SW as StreamWorkspace goroutine
participant Lock as c.mu (lock)
participant Close as Client.Close goroutine
Note over SW: Dial WebSocket (outside lock)
SW->>Lock: c.mu.Lock()
SW->>SW: stream.wg.Add(2) ← now under lock
SW->>SW: "c.workspaceStream = stream"
SW->>Lock: c.mu.Unlock()
par Goroutines launched
SW->>SW: go readWorkspaceStream → defer wg.Done()
SW->>SW: go writeLoop → defer wg.Done()
and Close fires
Close->>Lock: c.mu.Lock()
Close->>Close: "ws = c.workspaceStream (sees counter=2)"
Close->>Lock: c.mu.Unlock()
Close->>Close: CloseWorkspaceStream() / ws.Close()
Close->>Close: ws.Wait() — blocks until both Done()
end
SW-->>Close: wg.Done() x2 → Wait() returns
Reviews (2): Last reviewed commit: "test(agentctl): defer ctx cancel in Stre..." | Re-trigger Greptile
Wrap the per-iteration body in a runIter closure so `defer cancel()` covers every exit path (timeout fatals as well as the normal flow), replacing the explicit cancel() calls. Pure test hygiene — addresses Claude review comment on PR #1272. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Claude finished @jcfs's task in 1m 30s —— View job Code Review
SummaryThe PR is correct and ready to merge. The previous review comment about Race fix verified correct (
Regression test verified (
No new findings. The fix is minimal, targeted, and the regression test directly covers the CI failure window shown in the race trace. |
Summary
WARNING: DATA RACEonWorkspaceStream.wg(Add(2) vs Wait()) that has been failing Backend Tests onmainfor the last several pushes (TestInitializeAndPrompt_StreamBeforeInitialize,TestManager_RestartAgentProcess_StopErrorIsNonFatal).Client.StreamWorkspacepublishedc.workspaceStream = streamunder the lock, then released the lock and ranstream.wg.Add(2)afterwards. A concurrentClient.Closecould capture the new pointer under the same lock and callws.Wait()outside it — racing the not-yet-executedAdd(2)on a counter still at 0.stream.wg.Add(2)under the same lock that publishes the pointer. The unlock/lock pair now carries the Add intoClose's view, so any caller that observesc.workspaceStreamalready sees the counter bumped before invokingWait.TestClientClose_StreamWorkspaceAddWaitRace— waits for the mock server to observe the WS upgrade, then firesClosein parallel withStreamWorkspace. Pre-fix it deterministically triggers the same race in under 10 iterations; post-fix it stays quiet across 200+ iterations under-race.Race trace from CI (run 26925964563)
Test plan
go test -race -count=1 ./internal/agent/runtime/...— greengo test -race -run TestClientClose_StreamWorkspaceAddWaitRace -count=20 -cpu=8 ./internal/agent/runtime/agentctl/— green-racemake fmt lintclean🤖 Generated with Claude Code