Skip to content

fix(step-ca): prevent PID exhaustion from zombie health check processes#120

Open
rdemeritt wants to merge 1 commit into
mainfrom
fix/step-ca-zombie-pids
Open

fix(step-ca): prevent PID exhaustion from zombie health check processes#120
rdemeritt wants to merge 1 commit into
mainfrom
fix/step-ca-zombie-pids

Conversation

@rdemeritt
Copy link
Copy Markdown
Member

Summary

  • Root cause: step-ca container accumulates ~14,758 zombie PIDs over ~25 hours of uptime. Health check uses CMD-SHELL, which forks /bin/sh → wget. Since step-ca Go binary is PID 1 and doesn't reap children, each health check cycle leaves zombie processes. Eventually fork() returns EAGAIN and the container goes permanently unhealthy.
  • Fix 1 (init: true): Docker injects tini as PID 1, which properly reaps all zombie children from health check forks — the permanent fix.
  • Fix 2 (CMD exec form): Switches healthcheck from CMD-SHELL to CMD exec form, eliminating the intermediate /bin/sh process and halving the fork rate per health check cycle.

Option A (container restart) was already applied directly on macbeth to restore health immediately.

Test plan

  • Deploy to copper-rabbit stack on macbeth: ./scripts/start.sh -d --build
  • Confirm copper-rabbit-step-ca-1 shows (healthy) in docker ps
  • After 30+ minutes, verify pids.current in step-ca cgroup stays low (< 100)
  • Confirm Traefik ACME cert renewal still works (step-ca endpoint reachable)

Add init: true so tini reaps zombie children spawned by Docker's health
check mechanism. Switch healthcheck from CMD-SHELL to CMD (exec) form to
eliminate the intermediate /bin/sh child, halving the fork rate.

Without these fixes, step-ca accumulates ~14k zombie PIDs over ~25 hours
until fork() returns EAGAIN and the container goes permanently unhealthy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant