Skip to content

fix(server): clear _task_control on stop_task to prevent terminate#733

Open
dsapandora wants to merge 6 commits intodevelopfrom
fix/stop-task-cleanup-race
Open

fix(server): clear _task_control on stop_task to prevent terminate#733
dsapandora wants to merge 6 commits intodevelopfrom
fix/stop-task-cleanup-race

Conversation

@dsapandora
Copy link
Copy Markdown
Collaborator

@dsapandora dsapandora commented Apr 30, 2026

Summary

TaskServer.stop_task() did not remove the token from _task_control after stopping the task, leaving phantom registry entries. Any subsequent use() reusing the same token races against the orphaned entry — manifesting as either a hang/timeout or a "Task has already completed" error from the server.

To fix it, I route stop-time cleanup through the existing remove_task() helper so the registry pop, monitor subscription pruning, and task_removed dashboard broadcast all happen consistently — instead of calling control.task.stop_task() directly and leaking the slot.

This should resolve intermittent CI failures in TS/Python integration tests that reuse tokens across terminate → use sequences (most visible on Windows/Linux runner under load).

Also aligned the should get pipeline status test timeout (90s → TEST_CONFIG.timeout, 120s) to match the rest of the suite

Type

Testing

  • Tests added or updated
  • Tested locally
  • ./builder test passes

Checklist

  • Commit messages follow conventional commits
  • No secrets or credentials included
  • Wiki updated (if applicable)
  • Breaking changes documented (if applicable)

Linked Issue

Fixes #

Summary by CodeRabbit

  • Bug Fixes

    • Task termination now fully removes stopped tasks from the server registry immediately, avoiding phantom entries and race conditions and ensuring monitors/cleanup run as expected.
  • Tests

    • Integration test timeout switched to the shared test configuration for more consistent, reliable timing across environments.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

📝 Walkthrough

Walkthrough

stop_task() for LAUNCH/EXECUTE tasks now awaits control.task.stop_task() and then explicitly removes the task’s control record from the server registry via self.remove_task(token), ensuring the registry entry is cleared. A TypeScript test timeout was switched to use TEST_CONFIG.timeout instead of a hardcoded value.

Changes

Cohort / File(s) Summary
Task Server change
packages/ai/src/ai/modules/task/task_server.py
For LAUNCH/EXECUTE tasks, stop_task() now calls await control.task.stop_task() then self.remove_task(token), removing the control record from the server registry (unregisters task, unsubscribes monitors, emits task_removed).
Test timeout refactor
packages/client-typescript/tests/RocketRideClient.test.ts
Replaced hardcoded Jest timeout (90000) with shared TEST_CONFIG.timeout; test behavior unchanged.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant TaskServer
    participant InnerTask
    participant TaskRegistry
    participant MonitorService
    participant Dashboard

    Note over Client,TaskServer: Client requests task stop
    Client->>TaskServer: stop_task(token)
    TaskServer->>InnerTask: await control.task.stop_task()
    InnerTask-->>TaskServer: stopped
    TaskServer->>TaskRegistry: remove_task(token)
    TaskRegistry-->>TaskServer: removal confirmed
    TaskServer->>MonitorService: unsubscribe(token)
    MonitorService-->>TaskServer: unsubscribed
    TaskServer->>Dashboard: emit event: task_removed(token)
    Dashboard-->>TaskServer: ack
    TaskServer-->>Client: stop acknowledged
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

module:client-ts

Suggested reviewers

  • jmaionchi
  • stepmikhaylov

Poem

🐇 I stopped the inner hare with care,

then nudged its record from the lair.
Monitors quiet, dashboard sings,
I hop off light on nimble springs. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(server): clear _task_control on stop_task to prevent terminate' directly describes the main change: removing task control records during stop_task to prevent issues.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/stop-task-cleanup-race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

No description provided.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/ai/src/ai/modules/task/task_server.py`:
- Around line 1318-1320: Instead of directly popping the registry entry with
self._task_control.pop(token, None) in the stop path, call the existing cleanup
routine remove_task(token) (or extract/pop logic into the shared helper used by
remove_task) so that monitor subscription pruning and the task_removed dashboard
event still execute; locate the stop code that currently uses
self._task_control.pop and replace it with a call to remove_task(token) (or the
new shared helper) to ensure consistent connection-monitor state and lifecycle
signaling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: eadbc6bd-4c68-4665-8adf-30a7e06faafc

📥 Commits

Reviewing files that changed from the base of the PR and between a665f46 and 68a86f5.

📒 Files selected for processing (2)
  • packages/ai/src/ai/modules/task/task_server.py
  • packages/client-typescript/tests/RocketRideClient.test.ts

Comment thread packages/ai/src/ai/modules/task/task_server.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/client-typescript/tests/RocketRideClient.test.ts (1)

153-177: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Ensure pipeline termination runs even if status polling/assertions fail.

await client.terminate(result.token) is currently after the polling + assertions. If the retry loop throws (server never returns state) or an assertion fails, cleanup won’t run, leaving the task/token around and potentially making subsequent integration tests flaky.

Suggested change
-		it('should get pipeline status', async () => {
-			const result = await client.use({
-				pipeline: getEchoPipeline(),
-				token: PIPELINE_TOKEN,
-			});
-
-			// Retry a few times in case server is busy (tests may run in parallel)
-			const maxAttempts = 5;
-			const delayMs = 2000;
-			let status: Awaited<ReturnType<typeof client.getTaskStatus>> | null = null;
-			for (let attempt = 1; attempt <= maxAttempts; attempt++) {
-				try {
-					status = await client.getTaskStatus(result.token);
-					break;
-				} catch (e) {
-					if (attempt === maxAttempts) throw e;
-					await new Promise((r) => setTimeout(r, delayMs));
-				}
-			}
-
-			expect(status).toHaveProperty('state');
-			expect(Object.values(TASK_STATE)).toContain(status!.state);
-
-			await client.terminate(result.token);
-		}, TEST_CONFIG.timeout);
+		it(
+			'should get pipeline status',
+			async () => {
+				const result = await client.use({
+					pipeline: getEchoPipeline(),
+					token: PIPELINE_TOKEN,
+				});
+
+				try {
+					// Retry a few times in case server is busy (tests may run in parallel)
+					const maxAttempts = 5;
+					const delayMs = 2000;
+					let status: Awaited<ReturnType<typeof client.getTaskStatus>> | null = null;
+					for (let attempt = 1; attempt <= maxAttempts; attempt++) {
+						try {
+							status = await client.getTaskStatus(result.token);
+							break;
+						} catch (e) {
+							if (attempt === maxAttempts) throw e;
+							await new Promise((r) => setTimeout(r, delayMs));
+						}
+					}
+
+					expect(status).not.toBeNull();
+					expect(status).toHaveProperty('state');
+					expect(Object.values(TASK_STATE)).toContain(status!.state);
+				} finally {
+					// Best-effort cleanup so we don't leak tasks/tokens on failure
+					await client.terminate(result.token).catch(() => {});
+				}
+			},
+			TEST_CONFIG.timeout
+		);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/client-typescript/tests/RocketRideClient.test.ts` around lines 153 -
177, Wrap the polling/assertion block in a try/finally so
client.terminate(result.token) always runs: call client.use(...) to get result,
then perform the retry loop and asserts inside try, and invoke
client.terminate(result.token) in the finally block (ensuring result and
result.token are in scope before the try). This guarantees termination even if
client.getTaskStatus throws or assertions fail while still using the same
result.token.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@packages/client-typescript/tests/RocketRideClient.test.ts`:
- Around line 153-177: Wrap the polling/assertion block in a try/finally so
client.terminate(result.token) always runs: call client.use(...) to get result,
then perform the retry loop and asserts inside try, and invoke
client.terminate(result.token) in the finally block (ensuring result and
result.token are in scope before the try). This guarantees termination even if
client.getTaskStatus throws or assertions fail while still using the same
result.token.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0a6c76c3-8ff9-4c80-85c8-5e6fd2dcfb62

📥 Commits

Reviewing files that changed from the base of the PR and between 093decd and 0279d0f.

📒 Files selected for processing (1)
  • packages/client-typescript/tests/RocketRideClient.test.ts

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
packages/ai/src/ai/modules/task/task_server.py (1)

1321-1323: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Route stop-time removal through shared cleanup (or equivalent helper), not raw registry pop.

Line 1323 directly pops _task_control, which skips remove_task() side effects (monitor subscription pruning and task_removed dashboard broadcast). That can leave stale monitor state and inconsistent lifecycle signaling after terminate.

Suggested fix
-                # Free the slot so a subsequent use() with the same token does
-                # not race against a phantom registry entry.
-                self._task_control.pop(token, None)
+                # Free the slot using shared cleanup to keep monitor state and
+                # dashboard lifecycle events consistent.
+                await self.remove_task(token)

If double-stop is a concern here, extract a shared “post-stop cleanup” helper from remove_task() and call that from both paths.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/ai/src/ai/modules/task/task_server.py` around lines 1321 - 1323,
Instead of directly popping self._task_control in the terminate path, invoke the
shared post-stop cleanup used by remove_task() (or extract such a helper from
remove_task()) so that monitor subscription pruning and the task_removed
dashboard broadcast still run; update the terminate/stop code that currently
calls self._task_control.pop(token, None) to call the new helper (or
remove_task(token) if appropriate) and ensure it is idempotent to handle
double-stop calls safely while preserving monitor/state cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@packages/ai/src/ai/modules/task/task_server.py`:
- Around line 1321-1323: Instead of directly popping self._task_control in the
terminate path, invoke the shared post-stop cleanup used by remove_task() (or
extract such a helper from remove_task()) so that monitor subscription pruning
and the task_removed dashboard broadcast still run; update the terminate/stop
code that currently calls self._task_control.pop(token, None) to call the new
helper (or remove_task(token) if appropriate) and ensure it is idempotent to
handle double-stop calls safely while preserving monitor/state cleanup.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: dd53a7da-6820-41cd-8848-37b32bc7d58c

📥 Commits

Reviewing files that changed from the base of the PR and between 0279d0f and 4c7798e.

📒 Files selected for processing (1)
  • packages/ai/src/ai/modules/task/task_server.py

Copy link
Copy Markdown
Collaborator

@Rod-Christensen Rod-Christensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by design. Any task can restart but it removed 60 seconds after termination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants