Cold Start Time Analysis

Summary Comparison (2026-03-03)

Stage	Kata cold	Kata warm	runc cold	runc warm
Node	c6i.metal	c6i.metal	r6g.12xlarge	r6g.12xlarge
Arch	x86_64	x86_64	arm64	arm64
EC2 → Node Ready	55s	0s	41s	0s
kata-deploy	60s	0s	—	—
S3 restore	3s	3s	1s	1s
Image pull	41s	0s	29s	0s
OpenClaw Gateway	33s	32s	50s	50s
TOTAL	~5m	43s	2m 14s	54s

Note: OpenClaw Gateway starts ~18s slower on arm64 (Graviton) than x86_64 consistently across all tests. On an x86 node, runc warm pool would be ~35s.

Kata Cold Start (New Metal Node)

Test Environment

Date: 2026-03-03
Node: c6i.metal / us-west-2b / on-demand
S3 Data: state 37 files / 1.9 MB, workspace 20 files / 51 KB

Timeline

07:04:57  Karpenter: NodeClaim created
07:05:00  EC2 launched (c6i.metal)
07:05:35  Node registered                              +35s
07:05:54  Pod scheduled → FailedCreatePodSandBox
07:05:55  Node Ready                                   +55s
07:06:53  kata-deploy: start install
07:06:55  kata-deploy: containerd restarted, done      +60s from Node Ready
07:07:00  Sandbox retry → aws-cli pull (3s)
07:07:03  S3 restore done                               3s
07:09:15  OpenClaw image pull start              (132s gap)
07:09:53  OpenClaw image pulled                        38s
07:09:53  Containers started

Note: First pod was killed by idle timeout before reaching Ready (wake timed out at 4m, idle check terminated pod). Reconciler recreated on the same node with cached images — see Kata warm pool below.

Stage Breakdown (serial path)

#	Stage	Start → End	Duration	Notes
1	EC2 Launch → Node Ready	04:57 → 05:55	58s	Karpenter + kubelet bootstrap (metal)
2	kata-deploy install	06:53 → 06:55	2s	Artifacts copy + containerd restart
3	kata-deploy image pull	05:55 → 06:53	58s	DaemonSet container pull
4	Sandbox retries	05:56 → 07:00	64s	K8s backoff retry until kata ready
5	Init image pull + S3 restore	07:00 → 07:03	3s	aws-cli cached from DaemonSet pull
6	OpenClaw image pull	09:15 → 09:53	38s	1.13 GB, first pull on metal
7	Unexplained gap	07:03 → 09:15	132s	Between init done and main image pull

132s gap: Occurs between S3 restore completion and OpenClaw image pull start. Observed in both kata cold start tests. May be containerd settling after restart, or K8s image pull queue scheduling. Needs further investigation with kubelet/containerd debug logs.

Kata Warm Pool (Existing Metal Node)

Test Environment

Date: 2026-03-03
Node: c6i.metal (same node as cold start, images cached)

Timeline

07:18:15  Pod created + scheduled (instant)
07:18:18  S3 restore start
07:18:21  S3 restore done                               3s
07:18:22  OpenClaw started (pull 135ms, cached)
07:18:54  OpenClaw Gateway listening                    32s
07:18:58  Pod Ready ✅

TOTAL: 43s

#	Stage	Duration
1	Pod scheduling	0s
2	Kata VM boot + S3 restore	6s
3	OpenClaw Gateway startup	32s
4	Webhook + readiness probe	4s

runc Cold Start (New Node)

Test Environment

Date: 2026-03-03
Node: r6g.12xlarge / us-west-2c / spot

Timeline

07:43:01  Pod created
07:43:03  Karpenter: NodeClaim created
07:43:05  EC2 launched (r6g.12xlarge)
07:43:23  Node registered                              +20s
07:43:41  Pod scheduled
07:43:42  Node Ready                                   +39s
07:43:42  Init image pull start (aws-cli)
07:43:54  Init image pulled                            12s
07:43:54  S3 restore start
07:43:55  S3 restore done                               1s
07:44:06  OpenClaw image pull start
07:44:23  OpenClaw image pulled                        17s
07:44:23  Containers started
07:45:13  OpenClaw Gateway listening                   50s
07:45:15  Pod Ready ✅

TOTAL: 2m 14s

#	Stage	Start → End	Duration	Notes
1	EC2 Launch → Node Ready	43:01 → 43:42	41s	r6g.12xlarge spot
2	Init image pull (aws-cli)	43:42 → 43:54	12s	126 MB, first pull
3	S3 Restore	43:54 → 43:55	1s	37 files, 1.9 MB
4	OpenClaw image pull	44:06 → 44:23	17s	432 MB (arm64, smaller than x86)
5	OpenClaw Gateway startup	44:23 → 45:13	50s	arm64 Graviton
6	Webhook + Readiness probe	45:13 → 45:15	2s

Note: arm64 OpenClaw image is 432 MB vs 1.13 GB for x86_64 — pulls 2x faster.

runc Warm Pool (Existing Node)

Test Environment

Date: 2026-03-03
Node: r6g.12xlarge (same node as cold start, images cached)

Timeline

07:46:14  Pod created + scheduled (instant)
07:46:15  S3 restore start
07:46:16  S3 restore done                               1s
07:46:17  OpenClaw started (pull 123ms, cached)
07:47:07  OpenClaw Gateway listening                   50s
07:47:08  Pod Ready ✅

TOTAL: 54s

#	Stage	Duration
1	Pod scheduling	0s
2	S3 restore	1s
3	OpenClaw Gateway startup	50s
4	Webhook + readiness probe	1s

Key Findings

OpenClaw Gateway Startup: arm64 vs x86_64

Arch	Gateway Startup	Samples
x86_64 (c6i.metal)	32-34s	4 runs
arm64 (r6g.12xlarge)	50-54s	4 runs
Delta	~18s

This is the single largest performance difference between runc and Kata warm pool scenarios. The gateway startup is a black box — zero log output between container start and "listening" message.

Image Size: arm64 vs x86_64

Image	x86_64	arm64
OpenClaw	1.13 GB	432 MB
aws-cli	125 MB	126 MB

arm64 OpenClaw image is 2.6x smaller, resulting in faster pulls on cold start.

132s Gap in Kata Cold Start

Observed in both kata cold start tests: after S3 restore completes and before OpenClaw image pull begins, there is an unexplained ~130s gap. This does not occur in runc cold starts (only 11s gap). Likely related to containerd state after kata-deploy restarts it.

Bug Fixes Applied

The cold start tests exposed several issues:

Handler: Set tenant status to provisioning at wake start (not just for new tenants)
Reconciler: Skip orphan cleanup for pods belonging to provisioning tenants
Reconciler: syncProvisioningTenants promotes to running when pod becomes ready
Karpenter: consolidateAfter increased from 60s to 300s for kata-metal NodePool
IAM: Added table/tenant-registry/index/* to orchestrator role for GSI Query access

Optimization Opportunities

Optimization	Estimated Saving	Effort	Applies To
x86 nodes for runc	~18s (warm)	Low	runc
`ImagePullPolicy=IfNotPresent`	~30-60s (cold)	Low	Both
kata-deploy image pre-pull	~58s (kata cold)	Medium	Kata
Parallel S3 restore	~1.5s	Low	Both
Investigate OpenClaw Gateway startup	~20s?	Unknown	Both
Investigate 132s gap	~130s (kata cold)	Medium	Kata
runc mode (skip Kata entirely)	~3m 46s (cold)	Medium	New option
Warm pool (current)	eliminates EC2+pull	Done ✅	Both

Measurement Commands

# Kata cold start (no warm pool)
kubectl cordon <existing-kata-node>
kubectl run wake --rm -i --restart=Never --image=curlimages/curl -- \
  curl -s -X POST http://orchestrator.tenants.svc.cluster.local:8080/wake/<tenant>

# runc cold start (specific instance type)
# Create pod with nodeSelector: {node.kubernetes.io/instance-type: r6g.12xlarge}
# Karpenter provisions the node automatically

# Timestamps
kubectl get pod <pod> -n tenants -o json | python3 -c "
import json,sys; d=json.load(sys.stdin)
print('Created:', d['metadata']['creationTimestamp'])
[print(f'{c[\"type\"]:15s}', c['lastTransitionTime']) for c in d['status'].get('conditions',[])]
[print(f'Init: {s[\"startedAt\"]} → {s[\"finishedAt\"]}') for ic in d['status'].get('initContainerStatuses',[]) for s in [ic.get('state',{}).get('terminated',{})] if s]
[print(f'{cs[\"name\"]}: {s[\"startedAt\"]}') for cs in d['status'].get('containerStatuses',[]) for s in [cs.get('state',{}).get('running',{})] if s]
"

History

Date	Scenario	Node	Total
2026-03-02	Kata warm (before cleanup)	c8i.2xlarge	~45s
2026-03-02	Kata warm (after cleanup)	c8i.2xlarge	~38s
2026-03-03	Kata cold (new metal)	c6i.metal	~6m
2026-03-03	Kata warm	c6i.metal	43s
2026-03-03	runc cold (new node)	r6g.12xlarge	2m 14s
2026-03-03	runc warm	r6g.12xlarge	54s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cold Start Time Analysis

Summary Comparison (2026-03-03)

Kata Cold Start (New Metal Node)

Test Environment

Timeline

Stage Breakdown (serial path)

Kata Warm Pool (Existing Metal Node)

Test Environment

Timeline

runc Cold Start (New Node)

Test Environment

Timeline

runc Warm Pool (Existing Node)

Test Environment

Timeline

Key Findings

OpenClaw Gateway Startup: arm64 vs x86_64

Image Size: arm64 vs x86_64

132s Gap in Kata Cold Start

Bug Fixes Applied

Optimization Opportunities

Measurement Commands

History

FilesExpand file tree

cold-start-analysis.md

Latest commit

History

cold-start-analysis.md

File metadata and controls

Cold Start Time Analysis

Summary Comparison (2026-03-03)

Kata Cold Start (New Metal Node)

Test Environment

Timeline

Stage Breakdown (serial path)

Kata Warm Pool (Existing Metal Node)

Test Environment

Timeline

runc Cold Start (New Node)

Test Environment

Timeline

runc Warm Pool (Existing Node)

Test Environment

Timeline

Key Findings

OpenClaw Gateway Startup: arm64 vs x86_64

Image Size: arm64 vs x86_64

132s Gap in Kata Cold Start

Bug Fixes Applied

Optimization Opportunities

Measurement Commands

History