Skip to content

feat(bundles): bundle upgrade flow + durable cache-staleness self-heal#291

Merged
mgoldsborough merged 7 commits into
mainfrom
feat/bundle-upgrade-flow-v2
May 27, 2026
Merged

feat(bundles): bundle upgrade flow + durable cache-staleness self-heal#291
mgoldsborough merged 7 commits into
mainfrom
feat/bundle-upgrade-flow-v2

Conversation

@mgoldsborough
Copy link
Copy Markdown
Contributor

@mgoldsborough mgoldsborough commented May 27, 2026

Summary

Revives #168 (bundle upgrade flow, original work by @Ovaculos) on current main, cures the cache-staleness incident it left open, and homes app version management at the org level where the shared-cache architecture puts it.

The incident this cures

The mpak bundle cache dir is keyed by name only (<mpakHome>/cache/<scope>-<name>, no version) and is shared across every workspace on the platform. Once a pod caches version X, later boots reuse X and never pull a newer published version. When X's manifest violates the host-manifest schema, the install gate throws "Refusing to install" forever — even after a fixed version ships. The only workaround was deleting the cache dir on the pod and restarting.

What changed

Durable cure — boot / re-spawn self-heal
A typed HostManifestGateError lets startBundleSource force one re-pull from the registry when a cached manifest fails the host-manifest gate, before giving up — a published fix self-heals on the next spawn. This is the single chokepoint for every named re-spawn (boot reload, JIT install, configure-restart), so it cures the incident on the install path too. Offline-safe: a failed re-pull keeps the cached copy and surfaces the original gate error.

App version is an org concern (not per-workspace)
Because the cache is shared platform-wide, a bundle's version is org-global — changing it changes the version every workspace gets on its next respawn. So version management lives at the org level:

  • New org-scoped manage_apps tool (org_admin gated): list (registry apps across the org, deduped by bundle name), check_updates, and upgrade.
  • BundleLifecycleManager.upgradeApp force-pulls the shared cache once, then re-spawns every workspace's instance from it — consistent version platform-wide. (Looping the per-workspace path would no-op after the first workspace.) A spawn failure on any workspace transitions that instance to dead (no torn "running-without-source" state) and is reported per-workspace without aborting the rest.
  • /org/about renders a purpose-built OrgAboutTab: platform info (role-exempt) + an org-admin Installed Apps section with per-app update detection and an Update button. The dead workspace-scoped AboutTab is removed.
  • The per-workspace Connectors page keeps install / connect / auth / configure (ws_admin); stdio bundles show a muted version under the title.

Install adopts the org's current version (no force-pull on install). Installing an app in a workspace uses whatever version the org already has cached; a first-ever install cold-downloads the current release via prepareServer. We deliberately do not force-pull on install: the cache is org-shared, so doing so would let a ws_admin silently bump every workspace's version — bypassing the org_admin upgrade gate. manage_apps.upgrade (org_admin) is the sole intentional version-bump lever; it's a coordinated, immediate, all-workspace rollout, not exclusive ownership of the cached bytes (a gate-failing cached version still self-heals on spawn for anyone). Absent an explicit upgrade, workspaces converge to the org's cached version lazily on respawn.

Rebase notes (#168 was ~3 weeks stale)

  • manage_app was removed → install/uninstall moved to workspace-scoped manage_connectors; app version management is the new org-scoped manage_apps.
  • The About page (feat: bundle upgrade flow with update UI #168's UI target) was a deprecated fallback → replaced by OrgAboutTab at /org/about.
  • The upgrade mechanism is corrected for canonical reverse-DNS serverNames and threads WorkspaceContext + bundleMcp into startBundleSource (which feat: bundle upgrade flow with update UI #168 omitted, breaking host-resources + credential resolution on a hot-swap).

Tests

  • UnitHostManifestGateError typing + reason; installSource derivation for {name}/{path}/{url} refs; upgradeApp guards (not-installed, non-registry excluded) + no-op when no update is resolvable.
  • Integrationmanage_apps org_admin gating (non-admin → permission_denied), org-wide aggregation deduped by bundle name (excludes local/remote), check_updates, and upgrade no-op.

bun run verify passes locally (format, lint, tsc strict, cycles, codegen, unit + web + integration + smoke). Manually verified end-to-end against dev:worktree: /org/about detects an available update, Upgrade force-pulls once and re-spawns across workspaces, and the per-workspace manage_connectors upgrade action is gone.

Follow-up

The bundle hot-swap leaves a sub-second window where the source's tools are absent; the upgrade's data.changed can transiently error a concurrent engine run (e.g. the home briefing), which self-heals on the next run. Tracked in #293. (The misleading [engine] error: undefined log it surfaced through is fixed in this PR.)

Revives #168 on current main and closes the two force-refresh gaps it left open.

The mpak cache dir is keyed by name only (no version), so a pod that cached a
bad version of a bundle re-spawns it on every boot — and if that manifest fails
the host-manifest gate, the bundle is rejected forever even after a fixed
version is published (the manual workaround was deleting the cache dir on the
pod and restarting).

- Boot/re-spawn self-heal: a typed HostManifestGateError lets startBundleSource
  force one re-pull from the registry when a cached manifest fails the gate,
  before giving up — a published fix now self-heals on restart. Covers every
  named re-spawn (boot reload, JIT install, configure-restart). Offline-safe:
  a failed re-pull leaves the cached copy intact and surfaces the original gate
  error.
- User install force-refresh: handleInstallMpak force-pulls the latest version
  when a (stale) copy is already cached, so a reinstall gets the current release
  instead of silently reusing a months-old cache.
- upgrade(): hot-swap a registry bundle to the latest version (force pull,
  re-spawn preserving workspace data + credentials, re-register placements and
  re-sync automations, emit bundle.upgraded). Surfaced as manage_connectors
  actions `upgrade` and `check_updates`.
- Connectors detail page: role-gated Update button + available-update detection.

Ported from #168's manage_app / About-page surface (both since removed or
deprecated on main) to manage_connectors + the workspace-scoped Connectors page,
and corrected for canonical serverNames, WorkspaceContext, and bundleMcp deps.
App version is an org-global concern: the mpak cache is keyed by bundle name
only (no version) and shared across every workspace on the platform, so one
force-pull changes the version every workspace gets on its next respawn.
Presenting "upgrade" as a per-workspace ws_admin action (as the initial
revival did) was misleading. Move version management to the org.

- New org-scoped `manage_apps` tool (org_admin gated, like manage_users):
  `list` (registry apps across the org, deduped by bundle name since the
  version is shared), `check_updates`, and `upgrade` (org-wide).
- `BundleLifecycleManager.upgradeApp(bundleName, getRegistry)`: force-pull the
  shared cache ONCE, then re-spawn every workspace's instance from it — keeping
  the running version consistent platform-wide. Looping the per-workspace path
  would no-op after the first workspace (cache already latest). Extracted the
  re-spawn body into `respawnInstanceToCachedVersion`; removed the per-workspace
  `upgrade()`.
- Removed the `upgrade` / `check_updates` actions from the per-workspace
  `manage_connectors` tool (moved to `manage_apps`).
- Web: `/org/about` now renders a purpose-built `OrgAboutTab` — platform info
  (role-exempt) plus an org-admin Installed Apps section with per-app update
  detection and an Update button. Deleted the dead workspace-scoped `AboutTab`.
  The Connectors detail page drops the Update button and shows a read-only
  "version managed in Org → About" pointer for registry apps. Surfaced
  `installSource` on the connector payload so the pointer targets registry
  apps only.

The durable-cure mechanism from the initial revival (boot self-heal via
HostManifestGateError, install force-refresh) is unchanged; only the upgrade
surface and authorization move from workspace/ws_admin to org/org_admin.
…line

The "app version is managed in Org → About" line on the connector detail page
read oddly as a standalone banner. Replace it with a muted version number under
the connector title in the hero (stdio bundles only — remote OAuth connectors
have no meaningful bundle version). Drop the now-unused `installSource` field
from the connector list/get payload (it was added solely for that pointer).
…ries

`run.error` is overloaded — McpSource lifecycle events (source.crashed /
source.restarted / source.restart_failed) reuse the type with an `event`
discriminator. `source.restarted` is a successful crash-recovery and carries
no `error` field, so console-events printed a misleading "[engine] error:
undefined". Render it as "[engine] source restarted: <name>" instead, and fall
back to the event discriminator (then "unknown error") so a bare `undefined`
never reaches the log. Surfaced while QAing the bundle hot-swap respawn; the
underlying swap-window race is tracked in #293.
Critical:
- Remove the install-time force-pull from handleInstallMpak. The shared,
  name-keyed mpak cache means a force-pull there let a ws_admin silently bump
  every workspace's version on next respawn — bypassing the org_admin upgrade
  gate this PR is built around (the asymmetric-defense shape QA flagged). Under
  the org-version model, an install now ADOPTS the org's current cached version
  (first-ever install cold-downloads via prepareServer); org_admin
  manage_apps.upgrade is the sole intentional version-bump lever. The original
  "stuck on a bad version" incident stays cured without it — a gate-failing
  cached manifest still self-heals on spawn via startBundleSource's force-repull
  (on the install path too).
- upgradeApp/respawnInstanceToCachedVersion: a startBundleSource THROW (bad
  spawn, prepareServer error, terminal gate) after removeSource left the
  instance "running" with no live source (torn state — only the null-manifest
  branch transitioned to dead). Wrap the spawn so any failure transitions the
  instance to "dead" before rethrowing.

Polish:
- Fix upgradesInFlight doc comment (keyed by bundleName, not serverName|wsId).
- Fix stale "About tab" reference in handleListInstalled comment.
- Rename the no-op upgrade tests to reflect they exercise the
  no-update-resolvable (registry-unreachable) path, not a confirmed
  already-at-latest path.
@mgoldsborough mgoldsborough added the qa-reviewed QA review completed with no critical issues label May 27, 2026
…y mpakHome

- respawnInstanceToCachedVersion: a failed re-spawn left the old source's
  automations scheduled against a removed source (they error when they fire,
  until the next boot reload re-syncs). Extract a `failRespawn` helper that
  transitions the instance to "dead" AND drops its automations (best-effort),
  called on both failure branches — matching uninstall's unconditional cleanup.
- Unify the mpak cache path. `getWorkDir()` is not pre-resolved, so callers
  hand-building `join(getWorkDir(), "apps")` keyed getMpak()'s singleton on a
  relative string under a relative dev workDir — diverging from the lifecycle's
  resolved mpakHome and thrashing the singleton. Add `Runtime.getMpakHome()`
  (resolved, absolute) and use it from app-tools + connector-tools; add it to
  the stub Runtimes in the affected tests.

QA suggestion #2 (a "mixed versions" hint during lazy convergence) is the
documented eventual-consistency model working as designed — deferred, tracked
separately.
The upgrade flow and update UI in this PR originate from #168.

Co-authored-by: Mason <31372737+Ovaculos@users.noreply.github.com>
@mgoldsborough mgoldsborough merged commit 69a788e into main May 27, 2026
7 of 8 checks passed
@mgoldsborough mgoldsborough deleted the feat/bundle-upgrade-flow-v2 branch May 27, 2026 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qa-reviewed QA review completed with no critical issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant