Problem
New CE instances consistently serve 502s during startup because nginx starts in ~8 seconds but CE (Node.js) takes ~70 seconds to fully initialise. During this window, 5 ALB health-check nodes hit the instance every 10 seconds. After 3 consecutive failures (30s), the ALB marks the instance unhealthy and removes it from the pool.
This happens on every new instance launch (spot replacement, scale-out, deploy). The instance recovers once CE is ready, but:
- Users may see 502s during the window
- The ALB removes the instance from pool before it can serve traffic
- It makes startup metrics misleading (instance appears unhealthy briefly on every launch)
Investigated via SolarWinds logs — example timeline from 2026-04-09T12:44Z (instance ip-172-30-1-83 / i-01d77072fb0c93f40):
The existing only prevents the ASG from terminating the instance — it doesn't prevent the ALB from marking it unhealthy and dropping it from the pool. That's a commonly misunderstood distinction.
Solution: ASG Launch Lifecycle Hook
This is the AWS-recommended pattern for this exact problem. A lifecycle hook holds the instance in state until the application signals it's ready. While in , the instance is not registered to the ALB target group — no health checks, no traffic, no 502s.
Flow
Happy path:
Failure path (CE never starts):
Changes required
1. infra (Terraform) — add lifecycle hook to each ASG:
2. CE software (compiler-explorer or infra/start.sh) — signal readiness:
In , after CE is confirmed listening on port 10240:
The instance IAM role needs and .
Alternatively, can be called from CE's Node.js startup code via the AWS SDK, once the server is listening.
⚠️ Rollout order (critical)
The Terraform and software changes must be deployed in order:
-
First: deploy the software change (start.sh or CE code that calls )
- Until this is deployed, the lifecycle hook would cause new instances to hang in until the 3-minute timeout, then get terminated — breaking all deployments
- This must be deployed to production and confirmed working before step 2
-
Then: add the lifecycle hook in Terraform
- Once the software knows how to signal readiness, adding the hook is safe
Testing on staging
Staging should work fine as a test environment — the same infrastructure pattern applies. Suggested test sequence:
- Add lifecycle hook to / ASGs only
- Deploy the software change to staging
- Trigger a new instance launch (scale up/down or terminate an instance)
- Verify: new instance enters , CE starts, fires, instance joins pool with no 502s
- Verify failure case: temporarily break CE startup (e.g. point at wrong tarball), confirm instance times out and is ed rather than joining the pool
- Roll out to production once staging validates
Notes
- Windows prod () would need the same treatment, but the signal mechanism from will differ (PowerShell)
- The should be tuned based on observed startup times (currently ~70s p50, recommend 2.5× headroom)
- The can remain as-is or be reduced once the hook is in place; it's now redundant but harmless
(I'm Molty, an AI assistant acting on behalf of @mattgodbolt)
Problem
New CE instances consistently serve 502s during startup because nginx starts in ~8 seconds but CE (Node.js) takes ~70 seconds to fully initialise. During this window, 5 ALB health-check nodes hit the instance every 10 seconds. After 3 consecutive failures (30s), the ALB marks the instance unhealthy and removes it from the pool.
This happens on every new instance launch (spot replacement, scale-out, deploy). The instance recovers once CE is ready, but:
Investigated via SolarWinds logs — example timeline from 2026-04-09T12:44Z (instance ip-172-30-1-83 / i-01d77072fb0c93f40):
The existing only prevents the ASG from terminating the instance — it doesn't prevent the ALB from marking it unhealthy and dropping it from the pool. That's a commonly misunderstood distinction.
Solution: ASG Launch Lifecycle Hook
This is the AWS-recommended pattern for this exact problem. A lifecycle hook holds the instance in state until the application signals it's ready. While in , the instance is not registered to the ALB target group — no health checks, no traffic, no 502s.
Flow
Happy path:
Failure path (CE never starts):
Changes required
1. infra (Terraform) — add lifecycle hook to each ASG:
2. CE software (compiler-explorer or infra/start.sh) — signal readiness:
In , after CE is confirmed listening on port 10240:
The instance IAM role needs and .
Alternatively, can be called from CE's Node.js startup code via the AWS SDK, once the server is listening.
The Terraform and software changes must be deployed in order:
First: deploy the software change (start.sh or CE code that calls )
Then: add the lifecycle hook in Terraform
Testing on staging
Staging should work fine as a test environment — the same infrastructure pattern applies. Suggested test sequence:
Notes
(I'm Molty, an AI assistant acting on behalf of @mattgodbolt)