Add blue-green AgentUpgrade for host daemon binary#127
Conversation
Agent-Logs-Url: https://github.com/Azure/unbounded/sessions/09199dde-d498-4fd1-a4d0-d878fdf19099 Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/unbounded/sessions/09199dde-d498-4fd1-a4d0-d878fdf19099 Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
bcho
left a comment
There was a problem hiding this comment.
also update e2e test to validate the upgrade and rollback scenarios
Agent-Logs-Url: https://github.com/Azure/unbounded/sessions/c95ee777-56f9-4c29-9f15-326834a9cec4 Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/unbounded/sessions/c95ee777-56f9-4c29-9f15-326834a9cec4 Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
Added agent Kind e2e workflow steps and harness commands for successful AgentUpgrade and rollback from a failing upgraded daemon in 1b4ac89. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
bcho
left a comment
There was a problem hiding this comment.
can we confirm if there is race issue when two agent upgrade operations run in sequence. Considering the following scenarios for op A and B:
- A: succeeded, B: succeeded
- A: succeeded, B: failed
- A: failed, B: failed/succeeded
Agent-Logs-Url: https://github.com/Azure/unbounded/sessions/7a8119d9-3cfb-4aa3-821d-921a7ea952cd Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
bcho
left a comment
There was a problem hiding this comment.
e2e test is failing, please investigate
Agent-Logs-Url: https://github.com/Azure/unbounded/sessions/7a8119d9-3cfb-4aa3-821d-921a7ea952cd Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
Confirmed with sequential unit coverage in 424c334 for success/success, success/failure, failure/failure, and failure/success. The controller is also configured with Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Agent-Logs-Url: https://github.com/Azure/unbounded/sessions/112cde94-4091-4914-b11f-82988ba56a14 Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/unbounded/sessions/112cde94-4091-4914-b11f-82988ba56a14 Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
Investigated in 74db774. The prior failure was the AgentUpgrade completion timeout; the latest runnable agent e2e job I checked completed successfully, including both |
Agent-Logs-Url: https://github.com/Azure/unbounded/sessions/2ad3ccca-4671-4562-a84a-19317ddd0fbe Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/unbounded/sessions/920acaad-c356-47e2-ba72-5f74a6615c80 Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/unbounded/sessions/920acaad-c356-47e2-ba72-5f74a6615c80 Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Agent-Logs-Url: https://github.com/Azure/unbounded/sessions/103e2b54-b156-48a4-879b-c4dafbee3331 Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Adds host-side
AgentUpgradeexecution throughMachineOperation, usingspec.parameters.downloadURLas the source for the replacement agent binary. The daemon now keeps a last-known-good binary and installs systemd recovery wiring for rollback if the upgraded daemon fails to stay healthy.MachineOperation handling
OperationAgentUpgradein the agent daemon instead of only queueing it.spec.parameters["downloadURL"].unbounded-agent-daemon.service.MachineOperationby marking itFailedwith reasonDaemonFailed.Blue-green daemon binary staging
goalstates.AgentUpgradePaths.goalstates.ResolvedAgentUpgradePaths(), then computes the inactive blue/green target throughAgentUpgradePaths.NextTargetPath().unbounded-agentrelease tarball using shared agent binary/utilio helpers.unbounded-agent versionfrom the sharedagentbinaryinstall path immediately after download/install and before switching symlinks.agentbinary.InstallAndSwitchFromTarGz.unbounded-agent-current-> newly staged binaryunbounded-agent-last-good-> previous current binaryrenameio.Symlinkthrough utilio for atomic symlink replacement.goalstates.AgentUpgradePathsstruct to avoid leaking goal-state resolution details into daemon logic.Systemd recovery path
unbounded-agent-current.unbounded-agent-daemon-recovery.service.currentback tolast-goodand restarts the daemon.record-agent-upgrade-failure-signalsubcommand when recovery is triggered for an in-flightAgentUpgrade.goalstatesconstants to avoid path drift.goalstates.AgentUpgrade signal management
AgentUpgradePaths.SignalPathfor both pending and failure states.failureMessageonly for daemon rollback failure signals, while pending success signals omit it.agentUpgradeSignalOperatorinterface with a filesystem-backed implementation.Install/bootstrap alignment
/usr/local/bin/unbounded-agentand runs it directly.current,last-good, blue slot, and compatibility symlinks during daemon enablement.currentandlast-good./usr/local/bin/unbounded-agentas a compatibility symlink to the current agent.Design documentation
designs/agent-upgrade.mddescribing the AgentUpgrade state machine, host paths, staging and symlink switching, signal file, successful startup publication, recovery path, failure cases, and sequential blue-green upgrades.Validation coverage
AgentUpgrade.agentbinarycoverage for validating installed binaries immediately after archive extraction.Example operation: