Skip to content

fix: improve ACP connection reliability with spawn retry and auto-reconnect#2804

Merged
tanzhenxin merged 1 commit intoQwenLM:mainfrom
qqqys:feat/vscode-acp
Apr 5, 2026
Merged

fix: improve ACP connection reliability with spawn retry and auto-reconnect#2804
tanzhenxin merged 1 commit intoQwenLM:mainfrom
qqqys:feat/vscode-acp

Conversation

@qqqys
Copy link
Copy Markdown
Collaborator

@qqqys qqqys commented Apr 1, 2026

TLDR

Improve ACP connection reliability by adding spawn retry logic and automatic reconnection when the ACP process dies unexpectedly.

Screenshots / Video Demo

N/A — no user-facing UI change. This is a resilience improvement for the ACP connection layer. When the ACP process crashes, users will now see an automatic reconnection attempt instead of a silent failure, and a clear error message if reconnection fails.

Dive Deeper

This PR addresses connection reliability issues where the ACP process could fail to start (e.g., SIGTERM during the 1-second startup grace period) or die unexpectedly during a session.

Three key changes:

  1. Spawn retry with exponential backoff (qwenConnectionHandler.ts): connection.connect() now retries up to 3 times with exponential backoff (1s, 2s, 4s) on transient spawn failures, instead of failing immediately.

  2. Disconnect event propagation (qwenAgentManager.ts + chatTypes.ts): Added onDisconnected callback that fires when the ACP child process exits unexpectedly. This event propagates from the connection layer through QwenAgentManager to WebViewProvider.

  3. Auto-reconnect on unexpected disconnect (WebViewProvider.ts): When an initialized agent disconnects unexpectedly, WebViewProvider automatically attempts to re-establish the connection (up to 3 attempts with exponential backoff). If all attempts fail, a user-facing error message is shown prompting them to use the refresh button.

Reviewer Test Plan

  1. Pull the branch and run npm run test in packages/vscode-ide-companion — new tests cover the retry logic.
  2. To manually test spawn retry: temporarily break the CLI entry path or add a short delay/kill in the spawn process, and observe retry logs in the Output panel.
  3. To manually test auto-reconnect: start a session, then kill the ACP child process (kill <pid>), and verify:
    • Console logs show reconnect attempts
    • Connection re-establishes automatically
    • If you kill it repeatedly to exhaust retries, the agentConnectionError message appears in the webview

Testing Matrix

🍏 🪟 🐧
npm run
npx
Docker
Podman - -
Seatbelt - -

Linked issues / bugs

Reverts the revert in #2792 (which reverted #2666), reimplementing the reconnect logic with a more robust approach that includes spawn-level retry.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

📋 Review Summary

This PR improves ACP connection reliability by implementing a two-layer retry mechanism: spawn-level retry with exponential backoff in QwenConnectionHandler and automatic reconnection at the WebViewProvider level when the ACP process dies unexpectedly. The implementation is well-tested, follows existing code patterns, and addresses a real reliability issue where the ACP process could fail during startup or crash during a session.

🔍 General Feedback

  • Positive aspects:

    • Clear separation of concerns: spawn retry (connection layer) vs. runtime reconnection (UI layer)
    • Comprehensive test coverage for the retry logic with fake timers
    • Good logging throughout for debugging connection issues
    • Proper event propagation from connection → manager → webview provider
    • Exponential backoff with reasonable caps (4s for spawn, 5s for reconnect)
  • Architecture:

    • The three-layer approach (ConnectionHandler → QwenAgentManager → WebViewProvider) maintains clean abstraction boundaries
    • Event-driven design for disconnect notifications is appropriate for this use case
  • Code quality:

    • Consistent with existing code style and patterns
    • Good use of TypeScript types
    • Clear comments explaining the retry logic

🎯 Specific Feedback

🟡 High

  • File: packages/vscode-ide-companion/src/webview/providers/WebViewProvider.ts:987-1025 - The attemptAutoReconnect method calls doInitializeAgentConnection() but this method is not visible in the diff. Ensure this method properly handles the cliEntryPath parameter since reconnect() in QwenAgentManager requires it. If doInitializeAgentConnection doesn't have access to the CLI path, reconnection may fail.

  • File: packages/vscode-ide-companion/src/services/qwenAgentManager.ts:363-375 - The reconnect() method catches disconnect errors silently. While this is reasonable for idempotency, consider logging when an exception occurs during disconnect to help debug edge cases where the connection state is inconsistent.

🟢 Medium

  • File: packages/vscode-ide-companion/src/services/qwenConnectionHandler.ts:92-112 - The retry loop uses Math.pow(2, attempt - 1) for exponential backoff, but the delay calculation could be extracted to a helper function for reusability and testability. This pattern appears in multiple places (also in WebViewProvider.ts:997).

  • File: packages/vscode-ide-companion/src/webview/providers/WebViewProvider.ts:442-448 - The disconnect callback checks this.agentInitialized && !this.isReconnecting before attempting reconnect. Consider also checking if the disconnect was "expected" (e.g., user-initiated disconnect vs. crash). Currently, any disconnect triggers reconnection logic, which might not be desired if the user explicitly disconnected.

  • File: packages/vscode-ide-companion/src/services/qwenConnectionHandler.test.ts:154-187 - The test "retries connect on spawn failure and succeeds on second attempt" uses fake timers but doesn't explicitly advance time. While shouldAdvanceTime: true handles this, consider adding explicit vi.advanceTimersByTime() calls for clarity and to ensure the delay logic is actually being tested.

🔵 Low

  • File: packages/vscode-ide-companion/src/types/chatTypes.ts:81 - The onDisconnected callback is added to QwenAgentCallbacks but lacks JSDoc documentation. Consider adding a brief comment explaining when this callback fires (unexpected disconnects only).

  • File: packages/vscode-ide-companion/src/services/qwenAgentManager.ts:363-375 - The reconnect() method's JSDoc says "Re-spawns the ACP process and creates a new session" but the implementation calls disconnect() first. Consider clarifying that it performs a clean disconnect before reconnecting.

  • File: packages/vscode-ide-companion/src/webview/providers/WebViewProvider.ts:1017-1022 - The error message "Lost connection to Qwen agent and auto-reconnect failed. Please use the refresh button to try again." is user-facing but hardcoded. Consider if this should be localized or follow a standard error message pattern used elsewhere in the extension.

  • File: packages/vscode-ide-companion/src/services/qwenConnectionHandler.ts:96 - The log message uses an emoji (🚀 CONNECT() CALLED) which is inconsistent with the more professional logging style elsewhere. Consider standardizing the log format.

✅ Highlights

  • Excellent test coverage: The new test cases in qwenConnectionHandler.test.ts thoroughly cover the retry logic including success on first attempt, retry success, and exhaustion scenarios.

  • Clean event propagation: The onDisconnected callback chain from AcpConnectionQwenAgentManagerWebViewProvider is well-implemented and maintains separation of concerns.

  • Thoughtful backoff strategy: Using exponential backoff with caps (4s for spawn retry, 5s for reconnect) shows consideration for both quick recovery and avoiding resource exhaustion.

  • Good logging: Comprehensive logging at each stage will help debug connection issues in production.

  • Proper state management: The isReconnecting flag prevents concurrent reconnection attempts, avoiding race conditions.

@tanzhenxin tanzhenxin merged commit c9a64da into QwenLM:main Apr 5, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants