feat(bundles): bundle upgrade flow + durable cache-staleness self-heal#291
Merged
Conversation
Revives #168 on current main and closes the two force-refresh gaps it left open. The mpak cache dir is keyed by name only (no version), so a pod that cached a bad version of a bundle re-spawns it on every boot — and if that manifest fails the host-manifest gate, the bundle is rejected forever even after a fixed version is published (the manual workaround was deleting the cache dir on the pod and restarting). - Boot/re-spawn self-heal: a typed HostManifestGateError lets startBundleSource force one re-pull from the registry when a cached manifest fails the gate, before giving up — a published fix now self-heals on restart. Covers every named re-spawn (boot reload, JIT install, configure-restart). Offline-safe: a failed re-pull leaves the cached copy intact and surfaces the original gate error. - User install force-refresh: handleInstallMpak force-pulls the latest version when a (stale) copy is already cached, so a reinstall gets the current release instead of silently reusing a months-old cache. - upgrade(): hot-swap a registry bundle to the latest version (force pull, re-spawn preserving workspace data + credentials, re-register placements and re-sync automations, emit bundle.upgraded). Surfaced as manage_connectors actions `upgrade` and `check_updates`. - Connectors detail page: role-gated Update button + available-update detection. Ported from #168's manage_app / About-page surface (both since removed or deprecated on main) to manage_connectors + the workspace-scoped Connectors page, and corrected for canonical serverNames, WorkspaceContext, and bundleMcp deps.
App version is an org-global concern: the mpak cache is keyed by bundle name only (no version) and shared across every workspace on the platform, so one force-pull changes the version every workspace gets on its next respawn. Presenting "upgrade" as a per-workspace ws_admin action (as the initial revival did) was misleading. Move version management to the org. - New org-scoped `manage_apps` tool (org_admin gated, like manage_users): `list` (registry apps across the org, deduped by bundle name since the version is shared), `check_updates`, and `upgrade` (org-wide). - `BundleLifecycleManager.upgradeApp(bundleName, getRegistry)`: force-pull the shared cache ONCE, then re-spawn every workspace's instance from it — keeping the running version consistent platform-wide. Looping the per-workspace path would no-op after the first workspace (cache already latest). Extracted the re-spawn body into `respawnInstanceToCachedVersion`; removed the per-workspace `upgrade()`. - Removed the `upgrade` / `check_updates` actions from the per-workspace `manage_connectors` tool (moved to `manage_apps`). - Web: `/org/about` now renders a purpose-built `OrgAboutTab` — platform info (role-exempt) plus an org-admin Installed Apps section with per-app update detection and an Update button. Deleted the dead workspace-scoped `AboutTab`. The Connectors detail page drops the Update button and shows a read-only "version managed in Org → About" pointer for registry apps. Surfaced `installSource` on the connector payload so the pointer targets registry apps only. The durable-cure mechanism from the initial revival (boot self-heal via HostManifestGateError, install force-refresh) is unchanged; only the upgrade surface and authorization move from workspace/ws_admin to org/org_admin.
…line The "app version is managed in Org → About" line on the connector detail page read oddly as a standalone banner. Replace it with a muted version number under the connector title in the hero (stdio bundles only — remote OAuth connectors have no meaningful bundle version). Drop the now-unused `installSource` field from the connector list/get payload (it was added solely for that pointer).
…ries `run.error` is overloaded — McpSource lifecycle events (source.crashed / source.restarted / source.restart_failed) reuse the type with an `event` discriminator. `source.restarted` is a successful crash-recovery and carries no `error` field, so console-events printed a misleading "[engine] error: undefined". Render it as "[engine] source restarted: <name>" instead, and fall back to the event discriminator (then "unknown error") so a bare `undefined` never reaches the log. Surfaced while QAing the bundle hot-swap respawn; the underlying swap-window race is tracked in #293.
Critical: - Remove the install-time force-pull from handleInstallMpak. The shared, name-keyed mpak cache means a force-pull there let a ws_admin silently bump every workspace's version on next respawn — bypassing the org_admin upgrade gate this PR is built around (the asymmetric-defense shape QA flagged). Under the org-version model, an install now ADOPTS the org's current cached version (first-ever install cold-downloads via prepareServer); org_admin manage_apps.upgrade is the sole intentional version-bump lever. The original "stuck on a bad version" incident stays cured without it — a gate-failing cached manifest still self-heals on spawn via startBundleSource's force-repull (on the install path too). - upgradeApp/respawnInstanceToCachedVersion: a startBundleSource THROW (bad spawn, prepareServer error, terminal gate) after removeSource left the instance "running" with no live source (torn state — only the null-manifest branch transitioned to dead). Wrap the spawn so any failure transitions the instance to "dead" before rethrowing. Polish: - Fix upgradesInFlight doc comment (keyed by bundleName, not serverName|wsId). - Fix stale "About tab" reference in handleListInstalled comment. - Rename the no-op upgrade tests to reflect they exercise the no-update-resolvable (registry-unreachable) path, not a confirmed already-at-latest path.
…y mpakHome - respawnInstanceToCachedVersion: a failed re-spawn left the old source's automations scheduled against a removed source (they error when they fire, until the next boot reload re-syncs). Extract a `failRespawn` helper that transitions the instance to "dead" AND drops its automations (best-effort), called on both failure branches — matching uninstall's unconditional cleanup. - Unify the mpak cache path. `getWorkDir()` is not pre-resolved, so callers hand-building `join(getWorkDir(), "apps")` keyed getMpak()'s singleton on a relative string under a relative dev workDir — diverging from the lifecycle's resolved mpakHome and thrashing the singleton. Add `Runtime.getMpakHome()` (resolved, absolute) and use it from app-tools + connector-tools; add it to the stub Runtimes in the affected tests. QA suggestion #2 (a "mixed versions" hint during lazy convergence) is the documented eventual-consistency model working as designed — deferred, tracked separately.
The upgrade flow and update UI in this PR originate from #168. Co-authored-by: Mason <31372737+Ovaculos@users.noreply.github.com>
This was referenced May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Revives #168 (bundle upgrade flow, original work by @Ovaculos) on current
main, cures the cache-staleness incident it left open, and homes app version management at the org level where the shared-cache architecture puts it.The incident this cures
The mpak bundle cache dir is keyed by name only (
<mpakHome>/cache/<scope>-<name>, no version) and is shared across every workspace on the platform. Once a pod caches version X, later boots reuse X and never pull a newer published version. When X's manifest violates the host-manifest schema, the install gate throws "Refusing to install" forever — even after a fixed version ships. The only workaround was deleting the cache dir on the pod and restarting.What changed
Durable cure — boot / re-spawn self-heal
A typed
HostManifestGateErrorletsstartBundleSourceforce one re-pull from the registry when a cached manifest fails the host-manifest gate, before giving up — a published fix self-heals on the next spawn. This is the single chokepoint for every named re-spawn (boot reload, JIT install, configure-restart), so it cures the incident on the install path too. Offline-safe: a failed re-pull keeps the cached copy and surfaces the original gate error.App version is an org concern (not per-workspace)
Because the cache is shared platform-wide, a bundle's version is org-global — changing it changes the version every workspace gets on its next respawn. So version management lives at the org level:
manage_appstool (org_admin gated):list(registry apps across the org, deduped by bundle name),check_updates, andupgrade.BundleLifecycleManager.upgradeAppforce-pulls the shared cache once, then re-spawns every workspace's instance from it — consistent version platform-wide. (Looping the per-workspace path would no-op after the first workspace.) A spawn failure on any workspace transitions that instance todead(no torn "running-without-source" state) and is reported per-workspace without aborting the rest./org/aboutrenders a purpose-builtOrgAboutTab: platform info (role-exempt) + an org-admin Installed Apps section with per-app update detection and an Update button. The dead workspace-scopedAboutTabis removed.Install adopts the org's current version (no force-pull on install). Installing an app in a workspace uses whatever version the org already has cached; a first-ever install cold-downloads the current release via
prepareServer. We deliberately do not force-pull on install: the cache is org-shared, so doing so would let a ws_admin silently bump every workspace's version — bypassing the org_admin upgrade gate.manage_apps.upgrade(org_admin) is the sole intentional version-bump lever; it's a coordinated, immediate, all-workspace rollout, not exclusive ownership of the cached bytes (a gate-failing cached version still self-heals on spawn for anyone). Absent an explicit upgrade, workspaces converge to the org's cached version lazily on respawn.Rebase notes (#168 was ~3 weeks stale)
manage_appwas removed → install/uninstall moved to workspace-scopedmanage_connectors; app version management is the new org-scopedmanage_apps.OrgAboutTabat/org/about.WorkspaceContext+bundleMcpintostartBundleSource(which feat: bundle upgrade flow with update UI #168 omitted, breaking host-resources + credential resolution on a hot-swap).Tests
HostManifestGateErrortyping +reason;installSourcederivation for{name}/{path}/{url}refs;upgradeAppguards (not-installed, non-registry excluded) + no-op when no update is resolvable.manage_appsorg_admin gating (non-admin → permission_denied), org-wide aggregation deduped by bundle name (excludes local/remote), check_updates, and upgrade no-op.bun run verifypasses locally (format, lint, tsc strict, cycles, codegen, unit + web + integration + smoke). Manually verified end-to-end againstdev:worktree:/org/aboutdetects an available update, Upgrade force-pulls once and re-spawns across workspaces, and the per-workspacemanage_connectors upgradeaction is gone.Follow-up
The bundle hot-swap leaves a sub-second window where the source's tools are absent; the upgrade's
data.changedcan transiently error a concurrent engine run (e.g. the home briefing), which self-heals on the next run. Tracked in #293. (The misleading[engine] error: undefinedlog it surfaced through is fixed in this PR.)