feat(bundles): bundle upgrade flow + durable cache-staleness self-heal by mgoldsborough · Pull Request #291 · NimbleBrainInc/nimblebrain

mgoldsborough · 2026-05-27T04:11:10Z

Summary

Revives #168 (bundle upgrade flow, original work by @Ovaculos) on current main, cures the cache-staleness incident it left open, and homes app version management at the org level where the shared-cache architecture puts it.

The incident this cures

The mpak bundle cache dir is keyed by name only (<mpakHome>/cache/<scope>-<name>, no version) and is shared across every workspace on the platform. Once a pod caches version X, later boots reuse X and never pull a newer published version. When X's manifest violates the host-manifest schema, the install gate throws "Refusing to install" forever — even after a fixed version ships. The only workaround was deleting the cache dir on the pod and restarting.

What changed

Durable cure — boot / re-spawn self-heal
A typed HostManifestGateError lets startBundleSource force one re-pull from the registry when a cached manifest fails the host-manifest gate, before giving up — a published fix self-heals on the next spawn. This is the single chokepoint for every named re-spawn (boot reload, JIT install, configure-restart), so it cures the incident on the install path too. Offline-safe: a failed re-pull keeps the cached copy and surfaces the original gate error.

App version is an org concern (not per-workspace)
Because the cache is shared platform-wide, a bundle's version is org-global — changing it changes the version every workspace gets on its next respawn. So version management lives at the org level:

New org-scoped manage_apps tool (org_admin gated): list (registry apps across the org, deduped by bundle name), check_updates, and upgrade.
BundleLifecycleManager.upgradeApp force-pulls the shared cache once, then re-spawns every workspace's instance from it — consistent version platform-wide. (Looping the per-workspace path would no-op after the first workspace.) A spawn failure on any workspace transitions that instance to dead (no torn "running-without-source" state) and is reported per-workspace without aborting the rest.
/org/about renders a purpose-built OrgAboutTab: platform info (role-exempt) + an org-admin Installed Apps section with per-app update detection and an Update button. The dead workspace-scoped AboutTab is removed.
The per-workspace Connectors page keeps install / connect / auth / configure (ws_admin); stdio bundles show a muted version under the title.

Install adopts the org's current version (no force-pull on install). Installing an app in a workspace uses whatever version the org already has cached; a first-ever install cold-downloads the current release via prepareServer. We deliberately do not force-pull on install: the cache is org-shared, so doing so would let a ws_admin silently bump every workspace's version — bypassing the org_admin upgrade gate. manage_apps.upgrade (org_admin) is the sole intentional version-bump lever; it's a coordinated, immediate, all-workspace rollout, not exclusive ownership of the cached bytes (a gate-failing cached version still self-heals on spawn for anyone). Absent an explicit upgrade, workspaces converge to the org's cached version lazily on respawn.

Rebase notes (#168 was ~3 weeks stale)

manage_app was removed → install/uninstall moved to workspace-scoped manage_connectors; app version management is the new org-scoped manage_apps.
The About page (feat: bundle upgrade flow with update UI #168's UI target) was a deprecated fallback → replaced by OrgAboutTab at /org/about.
The upgrade mechanism is corrected for canonical reverse-DNS serverNames and threads WorkspaceContext + bundleMcp into startBundleSource (which feat: bundle upgrade flow with update UI #168 omitted, breaking host-resources + credential resolution on a hot-swap).

Tests

Unit — HostManifestGateError typing + reason; installSource derivation for {name}/{path}/{url} refs; upgradeApp guards (not-installed, non-registry excluded) + no-op when no update is resolvable.
Integration — manage_apps org_admin gating (non-admin → permission_denied), org-wide aggregation deduped by bundle name (excludes local/remote), check_updates, and upgrade no-op.

bun run verify passes locally (format, lint, tsc strict, cycles, codegen, unit + web + integration + smoke). Manually verified end-to-end against dev:worktree: /org/about detects an available update, Upgrade force-pulls once and re-spawns across workspaces, and the per-workspace manage_connectors upgrade action is gone.

Follow-up

The bundle hot-swap leaves a sub-second window where the source's tools are absent; the upgrade's data.changed can transiently error a concurrent engine run (e.g. the home briefing), which self-heals on the next run. Tracked in #293. (The misleading [engine] error: undefined log it surfaced through is fixed in this PR.)

Revives #168 on current main and closes the two force-refresh gaps it left open. The mpak cache dir is keyed by name only (no version), so a pod that cached a bad version of a bundle re-spawns it on every boot — and if that manifest fails the host-manifest gate, the bundle is rejected forever even after a fixed version is published (the manual workaround was deleting the cache dir on the pod and restarting). - Boot/re-spawn self-heal: a typed HostManifestGateError lets startBundleSource force one re-pull from the registry when a cached manifest fails the gate, before giving up — a published fix now self-heals on restart. Covers every named re-spawn (boot reload, JIT install, configure-restart). Offline-safe: a failed re-pull leaves the cached copy intact and surfaces the original gate error. - User install force-refresh: handleInstallMpak force-pulls the latest version when a (stale) copy is already cached, so a reinstall gets the current release instead of silently reusing a months-old cache. - upgrade(): hot-swap a registry bundle to the latest version (force pull, re-spawn preserving workspace data + credentials, re-register placements and re-sync automations, emit bundle.upgraded). Surfaced as manage_connectors actions `upgrade` and `check_updates`. - Connectors detail page: role-gated Update button + available-update detection. Ported from #168's manage_app / About-page surface (both since removed or deprecated on main) to manage_connectors + the workspace-scoped Connectors page, and corrected for canonical serverNames, WorkspaceContext, and bundleMcp deps.

App version is an org-global concern: the mpak cache is keyed by bundle name only (no version) and shared across every workspace on the platform, so one force-pull changes the version every workspace gets on its next respawn. Presenting "upgrade" as a per-workspace ws_admin action (as the initial revival did) was misleading. Move version management to the org. - New org-scoped `manage_apps` tool (org_admin gated, like manage_users): `list` (registry apps across the org, deduped by bundle name since the version is shared), `check_updates`, and `upgrade` (org-wide). - `BundleLifecycleManager.upgradeApp(bundleName, getRegistry)`: force-pull the shared cache ONCE, then re-spawn every workspace's instance from it — keeping the running version consistent platform-wide. Looping the per-workspace path would no-op after the first workspace (cache already latest). Extracted the re-spawn body into `respawnInstanceToCachedVersion`; removed the per-workspace `upgrade()`. - Removed the `upgrade` / `check_updates` actions from the per-workspace `manage_connectors` tool (moved to `manage_apps`). - Web: `/org/about` now renders a purpose-built `OrgAboutTab` — platform info (role-exempt) plus an org-admin Installed Apps section with per-app update detection and an Update button. Deleted the dead workspace-scoped `AboutTab`. The Connectors detail page drops the Update button and shows a read-only "version managed in Org → About" pointer for registry apps. Surfaced `installSource` on the connector payload so the pointer targets registry apps only. The durable-cure mechanism from the initial revival (boot self-heal via HostManifestGateError, install force-refresh) is unchanged; only the upgrade surface and authorization move from workspace/ws_admin to org/org_admin.

…line The "app version is managed in Org → About" line on the connector detail page read oddly as a standalone banner. Replace it with a muted version number under the connector title in the hero (stdio bundles only — remote OAuth connectors have no meaningful bundle version). Drop the now-unused `installSource` field from the connector list/get payload (it was added solely for that pointer).

…ries `run.error` is overloaded — McpSource lifecycle events (source.crashed / source.restarted / source.restart_failed) reuse the type with an `event` discriminator. `source.restarted` is a successful crash-recovery and carries no `error` field, so console-events printed a misleading "[engine] error: undefined". Render it as "[engine] source restarted: <name>" instead, and fall back to the event discriminator (then "unknown error") so a bare `undefined` never reaches the log. Surfaced while QAing the bundle hot-swap respawn; the underlying swap-window race is tracked in #293.

Critical: - Remove the install-time force-pull from handleInstallMpak. The shared, name-keyed mpak cache means a force-pull there let a ws_admin silently bump every workspace's version on next respawn — bypassing the org_admin upgrade gate this PR is built around (the asymmetric-defense shape QA flagged). Under the org-version model, an install now ADOPTS the org's current cached version (first-ever install cold-downloads via prepareServer); org_admin manage_apps.upgrade is the sole intentional version-bump lever. The original "stuck on a bad version" incident stays cured without it — a gate-failing cached manifest still self-heals on spawn via startBundleSource's force-repull (on the install path too). - upgradeApp/respawnInstanceToCachedVersion: a startBundleSource THROW (bad spawn, prepareServer error, terminal gate) after removeSource left the instance "running" with no live source (torn state — only the null-manifest branch transitioned to dead). Wrap the spawn so any failure transitions the instance to "dead" before rethrowing. Polish: - Fix upgradesInFlight doc comment (keyed by bundleName, not serverName|wsId). - Fix stale "About tab" reference in handleListInstalled comment. - Rename the no-op upgrade tests to reflect they exercise the no-update-resolvable (registry-unreachable) path, not a confirmed already-at-latest path.

…y mpakHome - respawnInstanceToCachedVersion: a failed re-spawn left the old source's automations scheduled against a removed source (they error when they fire, until the next boot reload re-syncs). Extract a `failRespawn` helper that transitions the instance to "dead" AND drops its automations (best-effort), called on both failure branches — matching uninstall's unconditional cleanup. - Unify the mpak cache path. `getWorkDir()` is not pre-resolved, so callers hand-building `join(getWorkDir(), "apps")` keyed getMpak()'s singleton on a relative string under a relative dev workDir — diverging from the lifecycle's resolved mpakHome and thrashing the singleton. Add `Runtime.getMpakHome()` (resolved, absolute) and use it from app-tools + connector-tools; add it to the stub Runtimes in the affected tests. QA suggestion #2 (a "mixed versions" hint during lazy convergence) is the documented eventual-consistency model working as designed — deferred, tracked separately.

The upgrade flow and update UI in this PR originate from #168. Co-authored-by: Mason <31372737+Ovaculos@users.noreply.github.com>

mgoldsborough added 3 commits May 26, 2026 18:06

mgoldsborough mentioned this pull request May 27, 2026

Bundle hot-swap can transiently error a concurrent engine run (briefing) during the source-swap window #293

Open

mgoldsborough added 2 commits May 26, 2026 21:57

mgoldsborough added the qa-reviewed QA review completed with no critical issues label May 27, 2026

mgoldsborough mentioned this pull request May 27, 2026

Org Apps view: surface mixed running versions during lazy convergence #295

Open

chore: credit original author of the bundle upgrade flow

9f71141

The upgrade flow and update UI in this PR originate from #168. Co-authored-by: Mason <31372737+Ovaculos@users.noreply.github.com>

This was referenced May 27, 2026

feat: bundle upgrade flow with update UI #168

Closed

test(web): fix flaky CI from api/client mock.module pollution #297

Merged

mgoldsborough merged commit 69a788e into main May 27, 2026
7 of 8 checks passed

mgoldsborough deleted the feat/bundle-upgrade-flow-v2 branch May 27, 2026 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bundles): bundle upgrade flow + durable cache-staleness self-heal#291

feat(bundles): bundle upgrade flow + durable cache-staleness self-heal#291
mgoldsborough merged 7 commits into
mainfrom
feat/bundle-upgrade-flow-v2

mgoldsborough commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mgoldsborough commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The incident this cures

What changed

Rebase notes (#168 was ~3 weeks stale)

Tests

Follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mgoldsborough commented May 27, 2026 •

edited

Loading