Skip to content

Add blue-green AgentUpgrade for host daemon binary#127

Merged
bcho merged 48 commits intomainfrom
copilot/agent-upgrade-blue-green-process
May 8, 2026
Merged

Add blue-green AgentUpgrade for host daemon binary#127
bcho merged 48 commits intomainfrom
copilot/agent-upgrade-blue-green-process

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 6, 2026

Adds host-side AgentUpgrade execution through MachineOperation, using spec.parameters.downloadURL as the source for the replacement agent binary. The daemon now keeps a last-known-good binary and installs systemd recovery wiring for rollback if the upgraded daemon fails to stay healthy.

  • MachineOperation handling

    • Executes OperationAgentUpgrade in the agent daemon instead of only queueing it.
    • Requires spec.parameters["downloadURL"].
    • Logs the AgentUpgrade download URL before staging.
    • Marks operations failed on invalid parameters, staging errors, staged binary preflight failures, restart command failures, or post-switch daemon rollback signals.
    • Records successful staging in a JSON signal before restarting unbounded-agent-daemon.service.
    • Completes successful operations from the restarted daemon by reading and clearing the pending JSON signal.
    • Publishes daemon rollback failure signals back to the related MachineOperation by marking it Failed with reason DaemonFailed.
  • Blue-green daemon binary staging

    • Describes the current and next desired binary state through goalstates.AgentUpgradePaths.
    • Resolves environment overrides and the current symlink target through goalstates.ResolvedAgentUpgradePaths(), then computes the inactive blue/green target through AgentUpgradePaths.NextTargetPath().
    • Downloads an unbounded-agent release tarball using shared agent binary/utilio helpers.
    • Extracts and stages the new binary into the inactive blue/green slot.
    • Rejects empty agent binary archive entries.
    • Preflights the staged binary with unbounded-agent version from the shared agentbinary install path immediately after download/install and before switching symlinks.
    • Installs the staged binary and switches blue-green symlinks through agentbinary.InstallAndSwitchFromTarGz.
    • Updates after preflight succeeds:
      • unbounded-agent-current -> newly staged binary
      • unbounded-agent-last-good -> previous current binary
    • Uses renameio.Symlink through utilio for atomic symlink replacement.
    • Keeps current target resolution and initial daemon binary target selection on the goalstates.AgentUpgradePaths struct to avoid leaking goal-state resolution details into daemon logic.
  • Systemd recovery path

    • Runs the daemon through unbounded-agent-current.
    • Adds unbounded-agent-daemon-recovery.service.
    • Adds recovery script that switches current back to last-good and restarts the daemon.
    • Uses the last-known-good agent binary to run a hidden record-agent-upgrade-failure-signal subcommand when recovery is triggered for an in-flight AgentUpgrade.
    • Keeps AgentUpgrade recovery signal JSON serialization in Go instead of duplicating signal structure in the shell recovery script.
    • The recovered daemon publishes and clears AgentUpgrade success or failure signals in a single startup path.
    • Renders unit/script paths with Go templates from goalstates constants to avoid path drift.
    • Defines daemon binary override and recovery signal path environment variable names in goalstates.
  • AgentUpgrade signal management

    • Uses a single JSON signal payload for both pending operation tracking and daemon failure reporting.
    • Uses one goalstate-resolved signal file at AgentUpgradePaths.SignalPath for both pending and failure states.
    • Uses failureMessage only for daemon rollback failure signals, while pending success signals omit it.
    • Manages the signal file through an agentUpgradeSignalOperator interface with a filesystem-backed implementation.
    • Treats AgentUpgrade signal files as JSON-only.
    • Removes hidden recovery command support for overriding signal paths; the command uses the goalstate-resolved path.
  • Install/bootstrap alignment

    • The install script downloads the first agent binary into /usr/local/bin/unbounded-agent and runs it directly.
    • The agent initializes current, last-good, blue slot, and compatibility symlinks during daemon enablement.
    • When only the compatibility binary exists, Go-side daemon bootstrap copies it into the blue slot before linking current and last-good.
    • Preserves /usr/local/bin/unbounded-agent as a compatibility symlink to the current agent.
  • Design documentation

    • Adds designs/agent-upgrade.md describing the AgentUpgrade state machine, host paths, staging and symlink switching, signal file, successful startup publication, recovery path, failure cases, and sequential blue-green upgrades.
  • Validation coverage

    • Extends the agent Kind e2e workflow to validate a successful AgentUpgrade.
    • Adds e2e validation that an always-failing upgraded binary fails the operation without changing the current symlink.
    • Adds rollback validation with a daemon-failing but preflight-valid upgraded binary to verify recovery restores the last-known-good binary and publishes the operation failure signal.
    • Dumps AgentUpgrade failure reason/message details during e2e rollback diagnostics.
    • Adds sequential upgrade unit coverage for success/success, success/failure, failure/failure, and failure/success operation sequences.
    • Adds unit coverage for initial daemon binary link bootstrap, executable binary target selection, blue-slot seeding from the initial compatibility binary, broken staged binary rejection, JSON signal publication/cleanup, recovery signal subcommand behavior, signal operator behavior, JSON-only signal parsing, current target resolution, next target selection, and publishing daemon rollback failure signals.
    • Adds shared agentbinary coverage for validating installed binaries immediately after archive extraction.

Example operation:

apiVersion: unbounded-kube.io/v1alpha3
kind: MachineOperation
metadata:
  name: upgrade-agent-on-worker-1
spec:
  machineRef: worker-1
  operationKind: AgentUpgrade
  parameters:
    downloadURL: https://example.com/releases/unbounded-agent-linux-amd64.tar.gz

Copilot AI assigned Copilot and bcho May 6, 2026
Copilot AI linked an issue May 6, 2026 that may be closed by this pull request
Copilot AI and others added 2 commits May 6, 2026 19:56
Copilot AI changed the title [WIP] Implement blue-green update for the host daemon systemd unit Add blue-green AgentUpgrade for host daemon binary May 6, 2026
Copilot AI requested a review from bcho May 6, 2026 20:02
Copilot finished work on behalf of bcho May 6, 2026 20:02
Copy link
Copy Markdown
Member

@bcho bcho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also update e2e test to validate the upgrade and rollback scenarios

Comment thread cmd/agent/internal/daemon/agentupgrade.go Outdated
Comment thread cmd/agent/internal/daemon/agentupgrade.go Outdated
Copilot AI and others added 2 commits May 6, 2026 20:16
Agent-Logs-Url: https://github.com/Azure/unbounded/sessions/c95ee777-56f9-4c29-9f15-326834a9cec4

Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 6, 2026

also update e2e test to validate the upgrade and rollback scenarios

Added agent Kind e2e workflow steps and harness commands for successful AgentUpgrade and rollback from a failing upgraded daemon in 1b4ac89.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • 192.0.2.1
    • Triggering command: /tmp/go-build4056271506/b1336/controller.test /tmp/go-build4056271506/b1336/controller.test -test.testlogfile=/tmp/go-build4056271506/b1336/testlog.txt -test.paniconexit0 -test.timeout=10m0s hTk3cTrSl83re-wIrJGT/hTk3cTrSl83re-wIrJGT -goversion go1.26.2 -c=4 -race -nolocalimports -importcfg /tmp/go-build4056271506/b1340/importcfg -o 6271506/b986/_pk-p mpile ux-amd64/pkg/tool/linux_amd64/vet -p s/go-mtree/xattr-o t ux-amd64/pkg/too-trimpath (packet block)
  • invalid-ip
    • Triggering command: /tmp/go-build4056271506/b1336/controller.test /tmp/go-build4056271506/b1336/controller.test -test.testlogfile=/tmp/go-build4056271506/b1336/testlog.txt -test.paniconexit0 -test.timeout=10m0s hTk3cTrSl83re-wIrJGT/hTk3cTrSl83re-wIrJGT -goversion go1.26.2 -c=4 -race -nolocalimports -importcfg /tmp/go-build4056271506/b1340/importcfg -o 6271506/b986/_pk-p mpile ux-amd64/pkg/tool/linux_amd64/vet -p s/go-mtree/xattr-o t ux-amd64/pkg/too-trimpath (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot finished work on behalf of bcho May 6, 2026 20:21
Copilot AI requested a review from bcho May 6, 2026 20:21
Copy link
Copy Markdown
Member

@bcho bcho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we confirm if there is race issue when two agent upgrade operations run in sequence. Considering the following scenarios for op A and B:

  • A: succeeded, B: succeeded
  • A: succeeded, B: failed
  • A: failed, B: failed/succeeded

Comment thread cmd/agent/internal/daemon/lifecycle.go Outdated
Comment thread pkg/agent/agentbinary/agentbinary.go Outdated
Comment thread pkg/agent/goalstates/constants.go Outdated
Copy link
Copy Markdown
Member

@bcho bcho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e2e test is failing, please investigate

Comment thread cmd/agent/internal/daemon/agentupgrade.go
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 6, 2026

can we confirm if there is race issue when two agent upgrade operations run in sequence. Considering the following scenarios for op A and B:

  • A: succeeded, B: succeeded...

Confirmed with sequential unit coverage in 424c334 for success/success, success/failure, failure/failure, and failure/success. The controller is also configured with MaxConcurrentReconciles: 1, and aee80f3 records completion before restarting the daemon to avoid the e2e timeout race.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • 192.0.2.1
    • Triggering command: /tmp/go-build3516031582/b1336/controller.test /tmp/go-build3516031582/b1336/controller.test -test.testlogfile=/tmp/go-build3516031582/b1336/testlog.txt -test.paniconexit0 -test.timeout=10m0s dPTM2tKMthq5xRDm2aBf/dPTM2tKMthq5xRDm2aBf -goversion go1.26.2 -c=4 -race -nolocalimports -importcfg /tmp/go-build3516031582/b1376/importcfg -o 6031582/b992/_pk-p -tools@v0.4.8/simain ux-amd64/pkg/tool/linux_amd64/vet -p oss/blackfriday/-o -tools@v0.4.8/si/tmp/go-build3516031582/b1377/_pkg_.a ux-amd64/pkg/too-trimpath (packet block)
  • invalid-ip
    • Triggering command: /tmp/go-build3516031582/b1336/controller.test /tmp/go-build3516031582/b1336/controller.test -test.testlogfile=/tmp/go-build3516031582/b1336/testlog.txt -test.paniconexit0 -test.timeout=10m0s dPTM2tKMthq5xRDm2aBf/dPTM2tKMthq5xRDm2aBf -goversion go1.26.2 -c=4 -race -nolocalimports -importcfg /tmp/go-build3516031582/b1376/importcfg -o 6031582/b992/_pk-p -tools@v0.4.8/simain ux-amd64/pkg/tool/linux_amd64/vet -p oss/blackfriday/-o -tools@v0.4.8/si/tmp/go-build3516031582/b1377/_pkg_.a ux-amd64/pkg/too-trimpath (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot finished work on behalf of bcho May 6, 2026 20:53
Copilot AI requested a review from bcho May 6, 2026 20:53
Copilot AI and others added 2 commits May 6, 2026 20:57
Agent-Logs-Url: https://github.com/Azure/unbounded/sessions/112cde94-4091-4914-b11f-82988ba56a14

Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 6, 2026

e2e test is failing, please investigate

Investigated in 74db774. The prior failure was the AgentUpgrade completion timeout; the latest runnable agent e2e job I checked completed successfully, including both Validate agent upgrade operation and Validate agent upgrade rollback. Newer workflow runs are action_required with no jobs yet, so there are no new failure logs to debug.

Copilot AI requested a review from bcho May 7, 2026 05:20
Copilot finished work on behalf of bcho May 7, 2026 05:21
Comment thread internal/provision/script_test.go Outdated
Comment thread cmd/agent/internal/daemon/agentupgrade.go Outdated
Copilot finished work on behalf of bcho May 7, 2026 17:17
Copilot AI requested a review from bcho May 7, 2026 17:17
Copilot AI and others added 2 commits May 7, 2026 17:22
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 7, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • 192.0.2.1
    • Triggering command: /tmp/go-build2108564134/b1336/controller.test /tmp/go-build2108564134/b1336/controller.test -test.testlogfile=/tmp/go-build2108564134/b1336/testlog.txt -test.paniconexit0 -test.timeout=10m0s aEajS2A7ckdYPf1vn6sy/aEajS2A7ckdYPf1vn6sy -goversion go1.26.2 -c=4 -race -nolocalimports -importcfg /tmp/go-build2108564134/b1385/importcfg -o 8564134/b1006/_p-p mpile ux-amd64/pkg/tool/linux_amd64/vet -p /util/retry t ux-amd64/pkg/too-trimpath (packet block)
  • invalid-ip
    • Triggering command: /tmp/go-build2108564134/b1336/controller.test /tmp/go-build2108564134/b1336/controller.test -test.testlogfile=/tmp/go-build2108564134/b1336/testlog.txt -test.paniconexit0 -test.timeout=10m0s aEajS2A7ckdYPf1vn6sy/aEajS2A7ckdYPf1vn6sy -goversion go1.26.2 -c=4 -race -nolocalimports -importcfg /tmp/go-build2108564134/b1385/importcfg -o 8564134/b1006/_p-p mpile ux-amd64/pkg/tool/linux_amd64/vet -p /util/retry t ux-amd64/pkg/too-trimpath (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot finished work on behalf of bcho May 7, 2026 17:27
Comment thread cmd/agent/internal/daemon/controller_machineoperation.go Outdated
Comment thread pkg/agent/agentbinary/agentbinary.go Outdated
@bcho bcho marked this pull request as ready for review May 7, 2026 17:33
@bcho bcho requested a review from a team May 7, 2026 17:33
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 7, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • 192.0.2.1
    • Triggering command: /tmp/go-build2868148237/b1336/controller.test /tmp/go-build2868148237/b1336/controller.test -test.testlogfile=/tmp/go-build2868148237/b1336/testlog.txt -test.paniconexit0 -test.timeout=10m0s N6GPtrHm_IYr2TJzuGiu/N6GPtrHm_IYr2TJzuGiu -goversion go1.26.2 -c=4 -race -nolocalimports -importcfg /tmp/go-build2868148237/b1368/importcfg -o 8148237/b995/_pk-p mpile ux-amd64/pkg/tool/linux_amd64/vet -p o/v2/internal/co-o t ux-amd64/pkg/too-trimpath (packet block)
  • invalid-ip
    • Triggering command: /tmp/go-build2868148237/b1336/controller.test /tmp/go-build2868148237/b1336/controller.test -test.testlogfile=/tmp/go-build2868148237/b1336/testlog.txt -test.paniconexit0 -test.timeout=10m0s N6GPtrHm_IYr2TJzuGiu/N6GPtrHm_IYr2TJzuGiu -goversion go1.26.2 -c=4 -race -nolocalimports -importcfg /tmp/go-build2868148237/b1368/importcfg -o 8148237/b995/_pk-p mpile ux-amd64/pkg/tool/linux_amd64/vet -p o/v2/internal/co-o t ux-amd64/pkg/too-trimpath (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot finished work on behalf of bcho May 7, 2026 17:39
Copilot AI requested a review from bcho May 7, 2026 17:39
@bcho bcho enabled auto-merge (squash) May 7, 2026 23:59
@bcho bcho merged commit e6d7b4d into main May 8, 2026
21 checks passed
@bcho bcho deleted the copilot/agent-upgrade-blue-green-process branch May 8, 2026 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[agent] blue-green update for the host daemon systemd unit

3 participants