feat(observability): surface bundle start failures at boot#213
Open
Ovaculos wants to merge 5 commits into
Open
feat(observability): surface bundle start failures at boot#213Ovaculos wants to merge 5 commits into
Ovaculos wants to merge 5 commits into
Conversation
Boot-time bundle startup failures previously went only to container stderr — operators had to grep to discover that a workspace bundle silently never came up. The user-visible symptom was a missing tool with no event trail in the workspace log, no SSE notification, and no entry in `/v1/health`. Three changes, all on the catch path in `startWorkspaceBundles`: 1. New `bundle.start_failed` engine event. Routed workspace-scoped via `SSE_ROUTES` (drives SSE fan-out) and `WORKSPACE_EVENTS` (persists to the workspace log). 2. `startWorkspaceBundles` returns a `failures: BundleStartFailure[]` array alongside `entries`. Runtime stashes it on `_bundleStartFailures` and exposes via `bundleStartFailures()`. 3. `HealthMonitor` takes a `startFailures` option at construction; `getStatus()` merges them in as terminal `dead` entries so `/v1/health` reflects the failed bundle instead of omitting it. Distinct from `bundle.crashed`, which requires a running source that went away. A start failure means no `McpSource` ever existed, so the record can't be restarted by the periodic health-check loop; the `dead` entry is terminal. Tests cover the catch path keeps siblings unaffected, the merged status in `getStatus()`, SSE routing, and workspace-log persistence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`BundleHealth.name` is the source name (slugged from manifest or URL). Two workspaces installing the same connector produce identical names, so a same-named start failure across workspaces would render as indistinguishable `dead` rows in `/v1/health`. Add an optional `wsId?: string` to `BundleHealth`. Populated only for boot-time start failures (the data is on `BundleStartFailure`); live entries leave it undefined because `McpSource` doesn't carry a wsId. Consumers can disambiguate without a schema migration on the live path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Return a shallow copy instead of the internal array reference, matching the pattern used by `bundleNames()` one method up. HealthMonitor stores the reference and doesn't mutate it today, but exposing the live array invites future callers to splice or push into it and silently corrupt the boot-time failure record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match the casing convention of sibling bundle.* events (`installed`, `uninstalled`, `crashed`, `recovered`, `dead`) which use a single camel/lowercase token after the dot. Pure rename — no payload or routing change. Internal-only event with no external subscribers yet, so safe to rename without a deprecation window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The startWorkspaceBundles failure tests asserted on a substring of
the error message ("Local bundle not found") emitted by buildLocalSource.
That couples the test to wording inside a different module — rewording
the error there would break tests here for no behavioral reason.
Switch to shape assertions: error and bundleName are non-empty strings,
plus the existing wsId / serverName equality checks. The behavior under
test is "a failure was recorded with the expected fields populated,"
not "the message reads exactly this."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #7
Summary
Boot-time bundle startup failures previously went only to container stderr — operators had to grep to discover that a workspace bundle silently never came up. The user-visible symptom: a missing tool with no event trail in the workspace log, no SSE notification, and no entry in
/v1/health.This branch adds three observability channels for boot failures and four small follow-ups from review.
What changed
Base feature (
1cf008d) —feat(observability): surface bundle start failures at bootThree changes on the catch path in
startWorkspaceBundles:bundle.startFailedengine event. Routed workspace-scoped viaSSE_ROUTES(drives SSE fan-out) andWORKSPACE_EVENTS(persists to the workspace log).startWorkspaceBundlesreturns afailures: BundleStartFailure[]array alongsideentries.Runtimestashes it on_bundleStartFailuresand exposes viabundleStartFailures().HealthMonitortakes astartFailuresoption at construction;getStatus()merges them in as terminaldeadentries so/v1/healthreflects the failed bundle instead of omitting it.Distinct from
bundle.crashed, which requires a running source that went away. A start failure means noMcpSourceever existed, so the record can't be restarted by the periodic health-check loop; thedeadentry is terminal.Follow-ups (review fixes)
f7372c1—feat(health-monitor): propagate wsId on dead BundleHealth entries. Two workspaces installing the same connector produce identical source names, so same-named start failures would render as indistinguishabledeadrows. OptionalwsId?: stringonBundleHealth, populated only for failures (live sources don't carry a wsId onMcpSource).4170a2d—refactor(runtime): defensive-copy bundleStartFailures() return value. MatchbundleNames()pattern one method up.af8c59f—refactor(events): rename bundle.start_failed to bundle.startFailed. Match casing of siblingbundle.*events (installed,uninstalled,crashed,recovered,dead).e55eefb—test(workspace-runtime): assert shape over error message string. Decouple frombuildLocalSourceerror wording.Test plan
bun run test:unit— 48 affected tests passbunx tsc --noEmit— typecheck cleanbun run verify— full CI parity locallybundle.startFailedline, (b)/v1/healthlists the bundle asstate: "dead"withwsIdpopulated, (c) SSE client receives the event.🤖 Generated with Claude Code