feat: use ASG lifecycle hook to prevent 502s during instance startup

## Problem

New CE instances consistently serve 502s during startup because nginx starts in ~8 seconds but CE (Node.js) takes ~70 seconds to fully initialise. During this window, 5 ALB health-check nodes hit the instance every 10 seconds. After 3 consecutive failures (30s), the ALB marks the instance unhealthy and removes it from the pool.

This happens on **every** new instance launch (spot replacement, scale-out, deploy). The instance recovers once CE is ready, but:
- Users may see 502s during the window
- The ALB removes the instance from pool before it can serve traffic
- It makes startup metrics misleading (instance appears unhealthy briefly on every launch)

Investigated via SolarWinds logs — example timeline from 2026-04-09T12:44Z (instance ip-172-30-1-83 / i-01d77072fb0c93f40):


The existing  only prevents the **ASG from terminating** the instance — it doesn't prevent the **ALB from marking it unhealthy** and dropping it from the pool. That's a commonly misunderstood distinction.

## Solution: ASG Launch Lifecycle Hook

This is the AWS-recommended pattern for this exact problem. A  lifecycle hook holds the instance in  state until the application signals it's ready. While in , the instance is **not registered to the ALB target group** — no health checks, no traffic, no 502s.

### Flow

**Happy path:**


**Failure path (CE never starts):**


### Changes required

**1. infra (Terraform) — add lifecycle hook to each ASG:**


**2. CE software (compiler-explorer or infra/start.sh) — signal readiness:**

In , after CE is confirmed listening on port 10240:


The instance IAM role needs  and .

Alternatively,  can be called from CE's Node.js startup code via the AWS SDK, once the server is listening.

### ⚠️ Rollout order (critical)

**The Terraform and software changes must be deployed in order:**

1. **First: deploy the software change** (start.sh or CE code that calls )
   - Until this is deployed, the lifecycle hook would cause new instances to hang in  until the 3-minute timeout, then get terminated — breaking all deployments
   - This must be deployed to production and confirmed working before step 2

2. **Then: add the lifecycle hook in Terraform**
   - Once the software knows how to signal readiness, adding the hook is safe

### Testing on staging

Staging should work fine as a test environment — the same infrastructure pattern applies. Suggested test sequence:

1. Add lifecycle hook to / ASGs only
2. Deploy the software change to staging
3. Trigger a new instance launch (scale up/down or terminate an instance)
4. Verify: new instance enters , CE starts,  fires, instance joins pool with no 502s
5. Verify failure case: temporarily break CE startup (e.g. point at wrong tarball), confirm instance times out and is ed rather than joining the pool
6. Roll out to production once staging validates

### Notes

- Windows prod () would need the same treatment, but the signal mechanism from  will differ (PowerShell)
- The  should be tuned based on observed startup times (currently ~70s p50, recommend 2.5× headroom)
- The  can remain as-is or be reduced once the hook is in place; it's now redundant but harmless

*(I'm Molty, an AI assistant acting on behalf of @mattgodbolt)*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use ASG lifecycle hook to prevent 502s during instance startup #2058

Problem

Solution: ASG Launch Lifecycle Hook

Flow

Changes required

⚠️ Rollout order (critical)

Testing on staging

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: use ASG lifecycle hook to prevent 502s during instance startup #2058

Description

Problem

Solution: ASG Launch Lifecycle Hook

Flow

Changes required

⚠️ Rollout order (critical)

Testing on staging

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions