feat(runtime): /runtimes/* HTTP surface + RuntimeStatusBar/ControlPanel UI#971
feat(runtime): /runtimes/* HTTP surface + RuntimeStatusBar/ControlPanel UI#971Dani Akash (DaniAkash) wants to merge 16 commits into
Conversation
Uniform HTTP surface backed by AgentRuntimeRegistry + runtime.executeAction: - GET /runtimes — list all registered runtimes (descriptor + status + capabilities) - GET /runtimes/:adapter/status — single status snapshot - GET /runtimes/:adapter/status/stream — SSE: snapshot on connect + every state transition - POST /runtimes/:adapter/actions/:action — capability-gated dispatch through executeAction - GET /runtimes/:adapter/logs — container-runtime logs (405 for host-process) Routes use zValidator for path/query/body so the typed RPC client picks up the schemas; mounted with the same requireTrustedAppOrigin middleware as /claw/* /terminal /acl-rules /monitoring.
Generic React Query hooks backed by the typed RPC client (hc<AppType>), keyed by adapter id. useRuntime polls /runtimes/:adapter/status every 5s by default; useRuntimeAction issues a capability-gated POST to /runtimes/:adapter/actions/:action and invalidates the status query on success; useRuntimeLogs is opt-in (disabled by default) for container runtimes.
RuntimeStatusBar — compact one-line bar with adapter name + state pill + optional Restart action. Reads from useRuntime(adapter); the pill covers every container and host-process state. extraPill / extraActions slots let openclaw add its control-plane pill and Open Terminal button without baking gateway specifics into the runtime layer. RuntimeControlPanel — capability-gated state-appropriate primary CTA: not_installed → Install, stopped → Start, errored → Restart + Reset, installing/starting → spinner, cli_missing/unhealthy → Reinstall CLI, running → optional Stop. extras slot for adapter-specific affordances (e.g. openclaw provider Setup dialog trigger).
…ge; drop legacy lifecycle UI AgentsPage now uses the new runtime-control components for OpenClaw lifecycle: - RuntimeControlPanel replaces GatewayStateCards (state-appropriate CTAs gated on capabilities). Provider config dialog trigger lives in the panel's extras slot. - RuntimeStatusBar replaces GatewayStatusBar (running pill + Restart). Control-plane pill + Open Terminal live in the bar's extra slots — gateway specifics stay outside the runtime layer. GatewayStatusBar.tsx deletes outright. The 'Unavailable' badge in AgentSummaryChips.tsx deletes — capabilities-driven UI surfaces the same signal more usefully on the new RuntimeControlPanel; the prop stays for upstream callers but is now a no-op. ControlPlaneAlert / LifecycleAlert / InlineErrorAlert from OpenClawControls remain — they're alerts for control-plane and mid-flight lifecycle states, distinct from the runtime control surface. They cover gateway-specific concerns the runtime layer doesn't model. Cleanup deferred to a follow-up.
✅ Tests passed — 1212/1216
|
Greptile SummaryThis PR lands the user-visible runtime layer: a uniform
Confidence Score: 4/5Safe to merge after addressing the minor cleanup items — the core server routes, hooks, and UI components are well-structured and covered by tests. The new routes are capability-gated, validated, and tested. The UI components cleanly replace their predecessors without introducing regressions on the primary openclaw flow. The findings are quality/cleanup items: an unused query key, a dead prop retained for callers that is never read, a label-fidelity regression in the control-plane pill, and a subtle SSE heartbeat leak on silent TCP drops. None affect correctness of the main flow today. useRuntime.ts (unused RUNTIME_QUERY_KEYS.list), AgentSummaryChips.tsx (dead adapterHealth prop), AgentsPage.tsx (ControlPlanePill label regression), runtimes.ts (SSE heartbeat cleanup) Important Files Changed
Sequence DiagramsequenceDiagram
participant UI as AgentsPage (React)
participant Hook as useRuntime / useRuntimeAction
participant RPC as Hono RPC Client
participant Server as /runtimes/* routes
participant Reg as AgentRuntimeRegistry
participant RT as AgentRuntime (openclaw)
UI->>Hook: useRuntime("openclaw") [5s poll]
Hook->>RPC: GET /runtimes/:adapter/status
RPC->>Server: GET /runtimes/openclaw/status
Server->>Reg: registry.get("openclaw")
Reg-->>Server: runtime instance
Server->>RT: runtime.getStatusSnapshot()
RT-->>Server: RuntimeStatusSnapshot
Server-->>RPC: "{ descriptor, status, capabilities }"
RPC-->>Hook: RuntimeView
Hook-->>UI: "{ data, isLoading }"
UI->>Hook: useRuntimeAction("openclaw")
UI->>Hook: "action.mutate({ action: "restart" })"
Hook->>RPC: POST /runtimes/openclaw/actions/restart
RPC->>Server: POST /runtimes/:adapter/actions/:action
Server->>RT: capabilities.includes("restart")?
RT-->>Server: true
Server->>RT: "runtime.executeAction({ type: "restart" })"
RT-->>Server: void
Server-->>RPC: "{ status: "ok", state: "starting" }"
RPC-->>Hook: success
Hook->>Hook: invalidateQueries(["runtime-status","openclaw"])
UI->>RPC: GET /runtimes/openclaw/status/stream (SSE)
RPC->>Server: SSE connect
Server->>RT: runtime.subscribe(writeSnapshot)
RT-->>Server: unsubscribe fn
loop every state change
RT->>Server: listener(snapshot)
Server-->>UI: "event: snapshot data: {...}"
end
loop every 15s
Server-->>UI: "event: heartbeat data: {ts:...}"
end
UI->>Server: abort
Server->>RT: unsubscribe()
Server->>Server: clearInterval(heartbeat)
|
…nder Start CTA for installed state Two stuck-state bugs in the new RuntimeControlPanel: 1. The runtime's state machine started fresh at not_installed on every server boot. tryAutoStart's short-circuit branches (gateway already running, auth pass) never drove the state transitions, so the UI saw not_installed for a gateway that was actually running. Add a syncState() method on OpenClawContainerRuntime that probes the actual container via cli.inspectContainer + /readyz and sets state accordingly. Wire it into tryAutoStart as the first step so it runs regardless of which branch the rest takes. 2. RuntimeControlPanel had no case for state === 'installed', so after a successful Install the panel went blank instead of offering the next step. Treat installed the same as stopped — show the Start CTA with copy that reflects the difference (image is pulled vs container exists but stopped). Optional-chained the syncState call so existing tests with partial runtime mocks don't crash on the missing method.
When a previous server boot wrote runtime-state.json after the gateway container had already been created with a different hostPort (e.g. 18789 held at allocate-time → container started on 18790), the persisted port disagrees with the live mapping. The runtime then probes the persisted port forever and the UI sticks at `starting`. `syncState` now reads `NetworkSettings.Ports` from inspect-container and adopts the actual host port for the gateway container's published port when it differs. The service then re-syncs `hostPort`/`httpClient` and rewrites runtime-state.json so the next boot starts from a clean slate. - ContainerInfo gains a flat `ports` array (parsed from `NetworkSettings.Ports`) - OpenClawContainerRuntime.syncState: reconcile hostPort from live mapping before probing /readyz - OpenClawService.tryAutoStart: adopt the runtime's reconciled port and persist it via writePersistedGatewayPort
…ismatch When a previous boot leaves a gateway running with a stale token, the realloc-on-auth-mismatch branch was bumping the persisted port without actually freeing the old container — ManagedContainer.start() no-ops when state==='running', so the next start cycle never recreated the container on the new port. The result: persisted/service/runtime drift back into mismatch, and history requests 500 with "gateway is not ready" even while the (stale) gateway keeps serving chat from the old port. Stop the gateway explicitly when we decide to bump off the port, so the upcoming start cycle goes through the full remove + create + start path on the freshly-allocated port. The token-mismatch test still passes; adds a new test pinning the stop-before-realloc behaviour.
…fresh install
Starting the gateway via the new RuntimeControlPanel "Start" CTA goes
through runtime.executeAction({type:'start'}) directly, bypassing
OpenClawService.tryAutoStart and its ensureStateEnvFile() seeding step.
On a freshly-wiped .browseros-dev that left nerdctl create failing with
"failed to open env file .../.openclaw/.env: no such file or directory".
Seed the file (empty, mode 0600) inside buildContainerSpec so the
runtime is self-sufficient. Service callers continue to work — their
ensureStateEnvFile is now an idempotent no-op once the file exists.
OpenClawService.getStatus was carrying its own view of "is the gateway alive" (running/stopped/uninitialized derived from machineStatus + isReady probe) while the new AgentRuntime maintains the canonical state machine. The two could disagree — most visibly after a wipe + partial restart, where the runtime correctly read not_installed but the service still reported running/connected from in-memory fields. Map the legacy status surface from runtime.getStatusSnapshot().state so both pills can't contradict each other. Clear controlPlaneStatus / lastGatewayError / lastRecoveryReason whenever the runtime isn't running — those signals are only meaningful for an alive gateway. First chunk of the legacy-lifecycle removal. Lifecycle methods on the service (restart/shutdown/tryAutoStart/etc.) and duplicated hostPort state still exist and will be removed in follow-up commits.
Removes the start/stop/restart/reconnectControlPlane/shutdown surface on
OpenClawService — these duplicated the new AgentRuntime state machine
and were the root cause of the two views disagreeing. UI flows now go
through runtime.executeAction via the RuntimeControlPanel; server
shutdown via getOpenClawRuntime().executeAction({type:'stop'}).
Server:
- delete service.start/stop/restart/reconnectControlPlane/shutdown +
stopGatewayLogTail (now unreferenced)
- delete /claw/start /claw/stop /claw/restart /claw/reconnect routes
- replace internal `await this.restart()` (createAgent, updateProviderKeys)
with `runtime.restartGateway` — provider-config changes only need a
container restart, not a control-plane re-probe
- main.ts shutdown handler uses getOpenClawRuntime().executeAction directly
UI:
- useOpenClawMutations drops startOpenClaw/stopOpenClaw/restartOpenClaw/
reconnectOpenClaw and pendingGatewayAction; setup/create/delete remain
- AgentsPage drops the legacy LifecycleAlert + ControlPlaneAlert blocks;
the RuntimeControlPanel already renders pending state on its own
action buttons
Tests:
- delete tests for the removed methods
- runtime mocks in restart-side tests now expose restartGateway directly
Port persistence + reconciliation now lives entirely on the runtime
side. Service keeps a lazy httpClient getter that always reads the
current port from runtime.getHostPort(), so a port change (via
syncState drift detection) propagates everywhere automatically.
Server:
- OpenClawContainerRuntime seeds hostPort from runtime-state.json at
construction (readPersistedGatewayPortSync) and writes back via
syncState when the live container's mapping drifts
- OpenClawService.hostPort, setPort, adoptRuntimeHostPort,
ensureGatewayPortAllocated, isCurrentGatewayAvailable,
isGatewayAvailable, isGatewayAuthenticated, isGatewayPortReady,
the httpClient field, and the local fetchOk all deleted
- tryAutoStart is now ~10 lines: syncState → executeAction({type:start})
→ control-plane probe; no port juggling, no auth-mismatch realloc
(that path was driving the broken-state bug from earlier)
- internal `this.hostPort` references now go through runtime.getHostPort()
Tests:
- delete the four obsolete tryAutoStart tests (each asserted internals
that are gone) plus the unused mockGatewayAuth helpers
- add two slim tryAutoStart tests pinning the new contract
- existing runtime tests still call setHostPort, so the method survives
as a test-only override
The runtime state machine is now the single source of truth in the UI; the old OpenClawStatus surface (controlPlaneStatus, lastGatewayError, lastRecoveryReason, the status enum) and its consumers are all dead weight after Chunks 1-4. Drop them. UI: - OpenClawControls.tsx: delete StatusBadge, ControlPlaneBadge, AgentsPageHeader, LifecycleAlert, ControlPlaneAlert, GatewayStateCards. Keep ProviderSelector + InlineErrorAlert — still used by the setup dialog and AgentsPage's inline error surface. - agents-page-utils.ts: delete getControlPlaneCopy, getRecoveryDetail, getGatewayUiState, getLifecycleBanner, canManageOpenClawAgents, shouldShowControlPlaneDegraded, getControlPlaneCopyForStatus. - agents-page-types.ts: delete GatewayUiState, LIFECYCLE_BANNER_COPY, CONTROL_PLANE_COPY, FALLBACK_CONTROL_PLANE_COPY, RECOVERY_REASON_COPY. - useOpenClaw.ts: delete OpenClawStatus + GatewayLifecycleAction.
The agents page only surfaced OpenClaw's lifecycle controls — Hermes auto-installed silently at boot with no UI visibility or manual handle. Adds a generic section that iterates over container-kind runtimes from /runtimes and renders a control panel + status bar per adapter. - new useRuntimes() hook hits GET /runtimes - new RuntimesSection renders one card per container runtime, with an adapter-keyed extras registry for adapter-specific affordances (panel extras + status-bar pill / actions) - AgentsPage replaces its hand-rolled openclaw panel + bar with the section, plugging Configure-provider + Terminal into the openclaw slot via the registry - the section becomes adapter-agnostic: new container runtimes show up on the page automatically (filtered by descriptor.kind === 'container')
ManagedContainer.start was firing the subclass `readinessProbe()` exactly once, the moment containerd reported the container as Up. For OpenClaw this raced the Node.js gateway's HTTP listener bind — containerd flips status as soon as the entrypoint process spawns, but the Express server takes a few hundred ms to start serving /readyz. Single-shot probe → unlucky → state='errored' with "Readiness probe failed after container reached running state". Pre-refactor (dev branch) didn't hit this because openclaw used a two-phase flow: `runtime.startGateway` (no probe) then `service.waitForReady` (polled /readyz for 30s). When the new runtime architecture folded openclaw under ManagedContainer, the polling was lost. Bring it into the base class: `ManagedContainer.start` now polls `readinessProbe()` within `descriptor.readinessProbe.timeoutMs` at `intervalMs` cadence. Deterministic probes (Hermes' `--version` exec) succeed on the first call and exit immediately — no extra latency. HTTP probes get the full budget they need. Also stops misapplying `descriptor.readinessProbe` to the containerd "Up" wait (which only takes ~50ms anyway — defaults are fine).
Summary
Stacked on #970 (feat/openclaw-runtime). Lands the user-visible piece of the AgentRuntime architecture: a uniform
/runtimes/<adapter>/*HTTP surface backed byruntime.executeAction(...)throughAgentRuntimeRegistry, plus capability-gated UI components that consume it.Server:
GET /runtimes— list all registered runtimes with descriptor + status snapshot + capabilitiesGET /runtimes/:adapter/status— single runtime statusGET /runtimes/:adapter/status/stream— SSE: snapshot on connect + every state transition + 15s heartbeatPOST /runtimes/:adapter/actions/:action— capability-gated dispatch throughexecuteAction. Body schema picks upagentIdforreset-wipe-agent. 405 if action not in capabilities; 400 on unknown action; 500 on action throw.GET /runtimes/:adapter/logs— container-runtime logs (405 for host-process)zValidatorfor path/query/body so the typed RPC client (hc<AppType>) picks up the schemas.UI:
useRuntime(adapter)/useRuntimeAction(adapter)/useRuntimeLogs(adapter)— generic React Query hooks backed by the typed RPC client. 5s default poll; mutations invalidate the status query on success.<RuntimeStatusBar adapter='…'>replacesGatewayStatusBar. Compact one-line bar with state pill + optional Restart.extraPillandextraActionsslots let openclaw add its control-plane pill and Open Terminal button without baking gateway specifics into the runtime layer.<RuntimeControlPanel adapter='…'>replacesGatewayStateCardsfrom OpenClawControls. Capability-gated state-appropriate primary CTA:not_installed → Install,stopped → Start,errored → Restart + Reset,installing/starting → spinner,cli_missing/unhealthy → Reinstall CLI,running → optional Stop.extrasslot for adapter-specific affordances (e.g. openclaw's provider Setup dialog trigger).AgentSummaryChips.tsxdeletes (capabilities-driven UI surfaces the signal more usefully on the new RuntimeControlPanel).GatewayStatusBar.tsxdeletes outright.ControlPlaneAlert/LifecycleAlert/InlineErrorAlertfrom OpenClawControls remain — they cover gateway-specific concerns the runtime layer doesn't model.Out of scope (deferred follow-ups):
/claw/{status,start,stop,restart,logs}lifecycle routes — UI still polls/claw/statusfor control-plane info that lives outside the runtime registry. Will land once the control-plane surface is moved to the runtime layer (Phase 7+).useOpenClaw.ts's lifecycle mutations — they're now a fallback, replaced by the new hooks at the call sites that matter.Test plan
bun run typecheckclean across server + UI (pre-existing missing-generated-graphql errors aside)biome checkclean on touched filestests/api/routes/runtimes.test.tscovering list/status/actions (capability gate, unknown action, agentId requirement, throw → 500) / logs (container vs host-process)ContainerCliflake also reproduces on plainorigin/dev)