Skip to content

Add instance restart policy#233

Merged
sjmiller609 merged 13 commits into
mainfrom
hypeship/restart-policy
May 18, 2026
Merged

Add instance restart policy#233
sjmiller609 merged 13 commits into
mainfrom
hypeship/restart-policy

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented May 16, 2026

Summary

  • add restart_policy and restart_status API fields with generated bindings
  • add a restart-policy controller for whole-instance restarts with backoff, max attempts, stable reset, and manual-stop suppression
  • stack on the healthcheck branch and restart unhealthy running instances through restart policy when policy is on_failure or always
  • document the healthcheck/restart-policy interaction

Testing

  • go test ./lib/restart-policy ./lib/healthcheck ./lib/providers
  • go test -tags containers_image_openpgp ./lib/instances -run 'Test(ValidateCreateRequestHealthCheck|ValidateUpdateInstanceRequestAllowsRestartPolicyOnly|NormalizeRestartPolicyWrapsInvalidRequest|RestartStatusAfterPolicyUpdatePreservesManualStop|RestartStatusAfterPolicyUpdateClearsRetryState)$'
  • go test -tags containers_image_openpgp ./cmd/api/api -run 'Test(CreateInstance_MapsHealthCheckPolicy|CreateInstance_MapsRestartPolicy|UpdateInstance_MapsHealthCheckPatch|UpdateInstance_MapsRestartPolicyPatch|UpdateInstance_RejectsInvalidRestartPolicy)$'
  • go test -tags containers_image_openpgp ./cmd/api -run 'Test(StartImageRetentionControllerSkipsNilController|StartImageRetentionControllerStartsRunner|ConfigureOCICacheGCSkipsDisabled|ConfigureOCICacheGCRejectsInvalidInterval)$'

Not run: full ./cmd/api/api package; it enters broader lifecycle coverage outside this focused change.


Note

Medium Risk
Adds new whole-instance restart supervision that can stop/start instances automatically based on exit/health signals, touching lifecycle state transitions and background controllers; misconfiguration or logic bugs could cause unexpected restart loops or suppressed restarts.

Overview
Adds instance restart supervision via new restart_policy (config) and restart_status (runtime) fields across the OpenAPI spec/generated oapi types and the instances API (CreateInstance, UpdateInstance, GetInstance mapping/validation).

Implements a new lib/restart-policy package plus an instances restart-policy controller that persists retry state, applies backoff/max-attempt/stable-window rules, suppresses restarts after manual StopInstance, and can treat health_check=unhealthy as a restart trigger for on_failure/always.

Wires the controller into cmd/api/main.go, extends health check controller to optionally call HandleHealthCheckUnhealthy, resets restart status on forks/snapshot restores, and adds unit + integration coverage for request mapping, validation, controller behavior, and metrics labels.

Reviewed by Cursor Bugbot for commit bb9b4f0. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 16, 2026

✱ Stainless preview builds for hypeman

This PR will update the hypeman SDKs with the following commit message.

feat: Add instance restart policy
hypeman-openapi studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅

⚠️ hypeman-typescript studio · code

Your SDK build had a failure in the lint CI job, which is a regression from the base state.
generate ✅build ✅lint ❗test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/adb5d8c01a4fc8a651846220783a2d18cfab155b/dist.tar.gz
hypeman-go studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ⏭️lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@cfdcca8b51e20d5b0be4e99abc96a99b10adfd24

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-05-18 15:18:16 UTC

@sjmiller609 sjmiller609 force-pushed the hypeship/restart-policy branch from 383ad8b to 365fa7d Compare May 16, 2026 03:00
@sjmiller609 sjmiller609 changed the base branch from main to hypeship/add-healthcheck-policy May 16, 2026 03:01
@sjmiller609 sjmiller609 marked this pull request as ready for review May 17, 2026 17:35
@firetiger-agent
Copy link
Copy Markdown

Monitoring Plan: Restart Policy

This PR introduces a new restart_policy feature on Hypeman instances along with a background RestartPolicyController goroutine that reconciles instance state every 5 seconds and handles health-check-driven stop/start cycles. The API gains new restart_policy / restart_status fields on Create and Update, and StopInstance has a behavior change: it now writes manual_stop to restart status even for already-stopped instances (previously a no-op).

The primary risks to watch are: stop/start loops if the reconciler misidentifies instances, a regression in StopInstance due to the new metadata write on the already-stopped path, and controller-level panics since this is a new background goroutine with no production history. Blast radius is limited — only instances with an explicit restart_policy set will be automatically restarted; browser sessions and other standard instances are unaffected. The plan checks against the 24h baseline of 0.069–0.096% 5xx error rate and monitors for restart-policy-specific WARN/ERROR logs. Status updates will be posted automatically on this PR as monitoring progresses.

View agent

Comment thread lib/instances/restart_policy.go
Comment thread lib/instances/manager.go
Comment thread lib/instances/restart_policy.go
Comment thread cmd/api/api/restart_policy.go Outdated
Comment thread lib/instances/restart_policy.go
Comment thread lib/instances/restart_policy.go
Comment thread lib/instances/manager.go Outdated
Comment thread lib/instances/restart_policy.go
@sjmiller609 sjmiller609 requested a review from yummybomb May 18, 2026 14:39
Base automatically changed from hypeship/add-healthcheck-policy to main May 18, 2026 15:03
@sjmiller609 sjmiller609 force-pushed the hypeship/restart-policy branch from 4f6aa80 to bb9b4f0 Compare May 18, 2026 15:08
@sjmiller609 sjmiller609 merged commit 4acbfdc into main May 18, 2026
11 checks passed
@sjmiller609 sjmiller609 deleted the hypeship/restart-policy branch May 18, 2026 15:15
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit bb9b4f0. Configure here.

m.notifyLifecycleEvent(ctx, LifecycleEventStart, inst)
}
return inst, err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restart ignores manual stop set between status write and start

Medium Severity

RestartInstance acquires the instance lock and calls startInstance without checking if manual_stop was set on the restart status. In startInstanceForRestartPolicy, setRestartStatusIfStopped acquires the lock, verifies no manual_stop, writes the attempt status, and releases the lock. Then RestartInstance separately acquires the lock to start. If a user calls StopInstance in the gap between these two lock acquisitions, markRestartManualStopLocked writes manual_stop to metadata, but RestartInstance never checks it—startInstance only validates instance state is Stopped, not the restart status. This TOCTOU race allows one unwanted automatic restart after an explicit manual stop.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit bb9b4f0. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants