Argo Workflows-based lifecycle management for Kata Containers.
This chart installs a namespace-scoped WorkflowTemplate that performs controlled,
node-by-node upgrades of kata-deploy with verification and automatic rollback on failure.
- Kubernetes cluster with kata-deploy installed via Helm (chart 3.27.0 or higher required; the workflow uses
helm upgrade --installand relies on the kata-deploy chart to set DaemonSetupdateStrategy.type=OnDelete) - Argo Workflows v3.4+ installed before installing kata-lifecycle-manager (this chart only installs the
WorkflowTemplate; it does not install Argo). Installation guide: Argo Workflows releases (not Argo CD) helmCLI andargoCLI (Argo Workflows CLI, notargocd)- Verification pod spec (see Verification Pod)
1. Install Argo Workflows first (if not already installed). See the Argo Workflows installation guide.
2. Install the chart from the OCI registry (published on GitHub Releases):
# Install latest (or pin a version with --version $version)
helm install kata-lifecycle-manager oci://ghcr.io/kata-containers/kata-lifecycle-manager-charts/kata-lifecycle-manager \
--namespace argoFor development from a local clone: helm install kata-lifecycle-manager . --namespace argo
A verification pod is required to validate each node after upgrade. The chart will fail to install without one.
Provide the verification pod when installing the chart:
helm install kata-lifecycle-manager oci://ghcr.io/kata-containers/kata-lifecycle-manager-charts/kata-lifecycle-manager \
--namespace argo \
--set-file defaults.verificationPod=./my-verification-pod.yamlThis verification pod is baked into the WorkflowTemplate and used for all upgrades.
One-off override for a specific upgrade run. The pod spec must be base64-encoded because Argo workflow parameters don't handle multi-line YAML reliably:
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0 \
-p verification-pod="$(base64 -w0 < ./my-verification-pod.yaml)"Note: During helm upgrade, kata-deploy's own verification is disabled
(--set verification.pod=""). This is because kata-deploy's verification is
cluster-wide (designed for initial install), while kata-lifecycle-manager performs
per-node verification with proper placeholder substitution.
Create a pod spec that validates your Kata deployment. The pod should exit 0 on success, non-zero on failure.
Example (my-verification-pod.yaml):
apiVersion: v1
kind: Pod
metadata:
name: ${TEST_POD}
spec:
runtimeClassName: kata-qemu
restartPolicy: Never
nodeSelector:
kubernetes.io/hostname: ${NODE}
tolerations:
- operator: Exists
containers:
- name: verify
image: quay.io/kata-containers/alpine-bash-curl:latest
command:
- sh
- -c
- |
echo "=== Kata Verification ==="
echo "Node: ${NODE}"
echo "Kernel: $(uname -r)"
echo "SUCCESS: Pod running with Kata runtime"| Placeholder | Description |
|---|---|
${NODE} |
Node hostname being upgraded/verified |
${TEST_POD} |
Generated unique pod name |
You are responsible for:
- Setting the
runtimeClassNamein your pod spec - Defining the verification logic in your container
- Using the exit code to indicate success (0) or failure (non-zero)
Failure modes detected:
- Pod stuck in Pending/
ContainerCreating(runtime can't start VM) - Pod crashes immediately (containerd/CRI-O configuration issues)
- Pod times out (resource issues, image pull failures)
- Pod exits with non-zero code (verification logic failed)
All of these trigger automatic rollback.
Nodes can be selected using labels, taints, or both.
Option A: Label-based selection (default)
# Label nodes for upgrade
kubectl label node worker-1 katacontainers.io/kata-lifecycle-manager-window=true
# Trigger upgrade
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0 \
-p node-selector="katacontainers.io/kata-lifecycle-manager-window=true"Option B: Taint-based selection
# Taint nodes for upgrade
kubectl taint nodes worker-1 kata-lifecycle-manager=pending:NoSchedule
# Trigger upgrade using taint selector
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0 \
-p node-taint-key=kata-lifecycle-manager \
-p node-taint-value=pendingOption C: Combined selection
# Use both labels and taints for precise targeting
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0 \
-p node-selector="node-pool=kata-pool" \
-p node-taint-key=kata-lifecycle-managerargo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0
# Watch progress
argo watch @latestNodes are upgraded sequentially (one at a time) to ensure fleet consistency. If any node fails verification, the workflow stops immediately and that node is rolled back. This prevents ending up with a mixed fleet where some nodes have the new version and others have the old version.
| Parameter | Description | Default |
|---|---|---|
argoNamespace |
Namespace for Argo resources | argo |
defaults.helmRelease |
kata-deploy Helm release name | kata-deploy |
defaults.helmNamespace |
kata-deploy namespace | kube-system |
defaults.nodeSelector |
Node label selector (optional if using taints) | "" |
defaults.nodeTaintKey |
Taint key for node selection | "" |
defaults.nodeTaintValue |
Taint value filter (optional) | "" |
defaults.verificationNamespace |
Namespace for verification pods | default |
defaults.verificationPod |
Pod YAML for verification (required) | "" |
defaults.drainEnabled |
Enable node drain before upgrade | false |
defaults.drainTimeout |
Timeout for drain operation | 300s |
defaults.helmSetValues |
Extra --set values for helm upgrade (see Custom Image) |
"" |
images.utils |
Image with Helm 4 and kubectl (multi-arch) | ghcr.io/kata-containers/lifecycle-manager-utils:latest |
When submitting a workflow, you can override:
| Parameter | Description |
|---|---|
target-version |
Required - Target Kata version |
helm-release |
Helm release name |
helm-namespace |
Namespace of kata-deploy |
node-selector |
Label selector for nodes |
node-taint-key |
Taint key for node selection |
node-taint-value |
Taint value filter |
verification-namespace |
Namespace for verification pods |
verification-pod |
Pod YAML with placeholders |
drain-enabled |
Whether to drain nodes before upgrade |
drain-timeout |
Timeout for drain operation |
helm-set-values |
Extra --set values for helm upgrade (see Custom Image) |
For each node selected by the node-selector label:
- Prepare: Annotate node with deploy status
- Cordon: Mark node as
unschedulable - Drain (optional): Evict pods if
drain-enabled=true - Helm Upgrade: Run
helm upgrade --installwithupdateStrategy.type=OnDelete(kata-deploy chart 3.27.0+ applies this)- This updates the DaemonSet spec but does NOT restart pods automatically
- Trigger Pod Restart: Delete the kata-deploy pod on THIS node only
- This triggers recreation with the new image on just this node
- Wait: Wait for new kata-deploy pod to be ready
- Verify: Run verification pod and check exit code
- On Success:
Uncordonnode, proceed to next node - On Failure: Automatic rollback (helm rollback + pod restart),
uncordon, workflow stops
True node-by-node control: By using updateStrategy: OnDelete, the workflow
ensures that only the current node's pod restarts. Other nodes continue running
the previous version until explicitly upgraded.
Nodes are processed sequentially (one at a time). If verification fails on any node, the workflow stops immediately, preventing a mixed-version fleet.
Default (drain disabled): Drain is not required for Kata upgrades. Running Kata VMs continue using the in-memory binaries. Only new workloads use the upgraded binaries.
Optional drain: Enable drain if you prefer to evict all workloads before any maintenance operation, or if your organization's operational policies require it:
# Enable drain when installing the chart
helm install kata-lifecycle-manager oci://ghcr.io/kata-containers/kata-lifecycle-manager-charts/kata-lifecycle-manager \
--namespace argo \
--set defaults.drainEnabled=true \
--set defaults.drainTimeout=600s \
--set-file defaults.verificationPod=./my-verification-pod.yaml
# Or override at workflow submission time
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0 \
-p drain-enabled=true \
-p drain-timeout=600sBy default, the workflow upgrades kata-deploy using the official chart images for the
specified target-version. To deploy from a custom image (e.g., your own registry or
a custom build), pass extra --set values that override the kata-deploy chart's image
settings.
At workflow submission (one-off):
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0 \
-p helm-set-values="image.repository=myregistry.io/kata-deploy,image.tag=my-custom-tag"Baked into the chart (persistent default):
helm install kata-lifecycle-manager oci://ghcr.io/kata-containers/kata-lifecycle-manager-charts/kata-lifecycle-manager \
--namespace argo \
--set-file defaults.verificationPod=./my-verification-pod.yaml \
--set 'defaults.helmSetValues=image.repository=myregistry.io/kata-deploy\,image.tag=my-custom-tag'The value uses standard Helm --set comma-separated syntax (key1=val1,key2=val2).
Any kata-deploy chart value can be overridden this way, not just the image.
Automatic rollback on verification failure: If the verification pod fails (non-zero exit), kata-lifecycle-manager automatically:
- Runs
helm rollbackto revert to the previous Helm release - Waits for kata-deploy DaemonSet to be ready with the previous version
Uncordonsthe node- Annotates the node with
rolled-backstatus
This ensures nodes are never left in a broken state.
Manual rollback: For cases where you need to rollback a successfully upgraded node:
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
--entrypoint rollback-node \
-p node-name=worker-1Check node annotations to monitor upgrade progress:
kubectl get nodes \
-L katacontainers.io/kata-lifecycle-manager-status \
-L katacontainers.io/kata-current-version| Annotation | Description |
|---|---|
katacontainers.io/kata-lifecycle-manager-status |
Current upgrade phase |
katacontainers.io/kata-current-version |
Version after successful upgrade |
Status values:
preparing- Upgrade startingcordoned- Node markedunschedulabledraining- Draining pods (only if drain-enabled=true)upgrading- Helm upgrade in progressverifying- Verification pod runningcompleted- Upgrade successfulrolling-back- Rollback in progress (automatic on verification failure)rolled-back- Rollback completed
The workflow uses updateStrategy.type=OnDelete to achieve true node-by-node control:
- Helm upgrade updates the DaemonSet spec but pods don't restart automatically
- The workflow explicitly deletes the kata-deploy pod on the current node
- Kubernetes recreates the pod with the new image on just that node
- Other nodes continue running the previous version until their turn
This ensures that if verification fails on Node B, Node A is still running the new version (verified working) while the workflow stops. No automatic cluster-wide rollback occurs unless explicitly triggered.
Rollback behavior:
- On verification failure,
helm rollbackreverts the DaemonSet spec - The pod on the failed node is deleted to restart with the previous version
- Already-verified nodes continue running the new version (their pods weren't touched)
Any project that uses the kata-deploy Helm chart can install this companion chart to get upgrade orchestration:
# Install kata-deploy
helm install kata-deploy oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy \
--namespace kube-system
# Install kata-lifecycle-manager from the published chart (see GitHub Releases)
helm install kata-lifecycle-manager oci://ghcr.io/kata-containers/kata-lifecycle-manager-charts/kata-lifecycle-manager \
--namespace argo \
--set-file defaults.verificationPod=./my-verification-pod.yaml
# Trigger upgrade
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0Note: target-version must be 3.27.0 or higher; the workflow will fail at prerequisites otherwise.
Workflows use the repository owner for GHCR paths so you can test from a fork (e.g. fidencio/kata-lifecycle-manager):
-
Build the workflow image
In your fork: Actions → "Build workflow image" → "Run workflow".
This pushesghcr.io/<your-username>/lifecycle-manager-utils:latest. -
Release the chart (optional)
Actions → "Release Helm Chart" → "Run workflow", set version (e.g.0.1.0-dev).
This pushes the chart toghcr.io/<your-username>/kata-lifecycle-manager-charts. -
Install from your fork
helm install kata-lifecycle-manager \ oci://ghcr.io/<your-username>/kata-lifecycle-manager-charts/kata-lifecycle-manager \ --version 0.1.0-dev \ --set-file defaults.verificationPod=./verification-pod.yaml \ --set images.utils=ghcr.io/<your-username>/lifecycle-manager-utils:latest \ --namespace argo
- Design Document - Architecture and design decisions
Apache License 2.0