Skip to content

feat: use ASG lifecycle hook to prevent 502s during instance startup #2058

@mattgodbolt-molty

Description

@mattgodbolt-molty

Problem

New CE instances consistently serve 502s during startup because nginx starts in ~8 seconds but CE (Node.js) takes ~70 seconds to fully initialise. During this window, 5 ALB health-check nodes hit the instance every 10 seconds. After 3 consecutive failures (30s), the ALB marks the instance unhealthy and removes it from the pool.

This happens on every new instance launch (spot replacement, scale-out, deploy). The instance recovers once CE is ready, but:

  • Users may see 502s during the window
  • The ALB removes the instance from pool before it can serve traffic
  • It makes startup metrics misleading (instance appears unhealthy briefly on every launch)

Investigated via SolarWinds logs — example timeline from 2026-04-09T12:44Z (instance ip-172-30-1-83 / i-01d77072fb0c93f40):

The existing only prevents the ASG from terminating the instance — it doesn't prevent the ALB from marking it unhealthy and dropping it from the pool. That's a commonly misunderstood distinction.

Solution: ASG Launch Lifecycle Hook

This is the AWS-recommended pattern for this exact problem. A lifecycle hook holds the instance in state until the application signals it's ready. While in , the instance is not registered to the ALB target group — no health checks, no traffic, no 502s.

Flow

Happy path:

Failure path (CE never starts):

Changes required

1. infra (Terraform) — add lifecycle hook to each ASG:

2. CE software (compiler-explorer or infra/start.sh) — signal readiness:

In , after CE is confirmed listening on port 10240:

The instance IAM role needs and .

Alternatively, can be called from CE's Node.js startup code via the AWS SDK, once the server is listening.

⚠️ Rollout order (critical)

The Terraform and software changes must be deployed in order:

  1. First: deploy the software change (start.sh or CE code that calls )

    • Until this is deployed, the lifecycle hook would cause new instances to hang in until the 3-minute timeout, then get terminated — breaking all deployments
    • This must be deployed to production and confirmed working before step 2
  2. Then: add the lifecycle hook in Terraform

    • Once the software knows how to signal readiness, adding the hook is safe

Testing on staging

Staging should work fine as a test environment — the same infrastructure pattern applies. Suggested test sequence:

  1. Add lifecycle hook to / ASGs only
  2. Deploy the software change to staging
  3. Trigger a new instance launch (scale up/down or terminate an instance)
  4. Verify: new instance enters , CE starts, fires, instance joins pool with no 502s
  5. Verify failure case: temporarily break CE startup (e.g. point at wrong tarball), confirm instance times out and is ed rather than joining the pool
  6. Roll out to production once staging validates

Notes

  • Windows prod () would need the same treatment, but the signal mechanism from will differ (PowerShell)
  • The should be tuned based on observed startup times (currently ~70s p50, recommend 2.5× headroom)
  • The can remain as-is or be reduced once the hook is in place; it's now redundant but harmless

(I'm Molty, an AI assistant acting on behalf of @mattgodbolt)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions