Skip to content

perf(runtime): unblock event loop + bypass server for switch action#41

Open
panosAthDBX wants to merge 13 commits into
Ataraxy-Labs:mainfrom
panosAthDBX:perf/async-tmux-bypass-switch
Open

perf(runtime): unblock event loop + bypass server for switch action#41
panosAthDBX wants to merge 13 commits into
Ataraxy-Labs:mainfrom
panosAthDBX:perf/async-tmux-bypass-switch

Conversation

@panosAthDBX
Copy link
Copy Markdown

Summary

Brings rapid M-N session switching from ~1.5s/switch under load to ~60ms/switch on an interactive setup with 4 sessions, 3 attached sidebars, and a chatty remote hermes watcher pushing 3-5 events/sec.

Three layers, ordered by where they unblock:

1. Keybinding & script overhead (~30%)

  • run-shell -b on all bindings so tmux's input loop doesn't block while shells run. Without -b, rapid presses serialise on tmux's command queue and reentrant tmux calls inside the script compound the wait.
  • Pass #{client_tty}|#{session_name}|#{window_id} as a positional arg from the bind-key, skipping a tmux display-message fork (~16ms).
  • server-common.sh: alive-check tmpfile (5s TTL) skips the ~80ms healthcheck curl on the hot path; plugin-dir / bun lookup deferred to the cold-start branch only.
  • Bump curl -m 0.2-m 1.5, --connect-timeout 0.10.3, end with || true; exit 0 so tmux's run-shell never surfaces curl exit 28 (timeout) in the status line during transient spikes.

2. Server-side event loop (~40%)

  • Drop invalidateCurrentSessionCache() from broadcastStateImmediate. The 500ms TTL cache was being wiped on every broadcast (per agent-emit, dozens/sec under chatty watchers), defeating its purpose. handleFocus() still invalidates on real focus changes.
  • Swap raw getCurrentSession()getCachedCurrentSession() at the hot computeState / ensure-sidebar / toggle sites.
  • Debounce broadcastState (30ms) and refreshPaneAgents (250ms). Sub-perception; coalesce bursts that previously each triggered a fresh tmux subprocess (or tmux list-panes -a + ps + lsof in the case of refreshPaneAgents).
  • New AsyncReadCapable interface + isAsyncReadCapable guard. TmuxClient gains runAsync (uses Bun.spawn(...).exited) and async siblings: listSessionsAsync, listPanesAsync, listClientsAsync, getCurrentSessionAsync, getActiveSessionDirsAsync, getAllPaneCountsAsync. TmuxProvider implements them; its async listSessions runs the two underlying tmux subprocesses (list-sessions + active-session-dirs) in parallel via Promise.all.
  • computeState() and broadcastStateImmediate() are now async. Provider list calls in computeState run through Promise.all on the async API where available; legacy providers fall back to sync. broadcastStateImmediate uses an in-flight + trailing flag so concurrent triggers coalesce instead of stacking.

3. Bypass server for the switch action (the rest)

  • Server writes ORDERING_FILE (/tmp/opensessions.<KEY>.ordering) on every state broadcast — one visible session name per line.
  • switch-index.sh fast-path: read line N from the file, call tmux switch-client -t <name> directly. The server-side POST still goes out (backgrounded) so the server can update unseen flags / sidebar focus / custom ordering, but user-perceived latency is just one tmux fork. The existing tmux hook (client-session-changed/focus) keeps the server in sync.
  • Cold-path fallback preserved: if ordering file is missing (boot, or ordering not yet broadcast), behave as before.

Numbers

Metric Before After
GET / healthcheck 218 ms 20 ms
POST /switch-index isolated 1033 ms (timed out at 1 s) 54 ms
5 parallel POSTs total timed out 32 ms
switch-index.sh end-to-end (warm) 270 ms 60 ms
5 rapid-fire switches ~1.5 s ~280 ms

Test plan

  • bun build apps/server/src/main.ts — clean
  • Server boots, sidebars spawn in all 3 sessions, ordering file written
  • M-N rapid-fires don't surface "returned 28" in tmux status
  • Cold-path fallback still works (delete ordering file, switch still goes through server)
  • Maintainer: run existing test suites in packages/runtime/test/ to confirm broadcastStateImmediate async migration doesn't regress sidebar coordinator state machine
  • Maintainer: verify zellij provider still works (it doesn't implement AsyncReadCapable so the runtime falls back to sync paths via the type guard)

Notes

This pull request and its description were written by Isaac.

panosAthDBX added 11 commits May 4, 2026 11:14
Adds an opt-in background poller that reads the macOS Appearance setting
(`defaults read -g AppleInterfaceStyle`) every few seconds and flips
the active theme between a configured dark and light pair when the user
toggles System Settings → Appearance.

Config:
  autoThemeFollowsSystem: true
  darkTheme:  "catppuccin-mocha"    (default)
  lightTheme: "catppuccin-latte"    (default)

macOS-gated — no-op on Linux/Windows. Uses the same broadcast path as
the manual theme picker so every connected sidebar re-renders in sync.
Poller is a 3s interval cleaned up on shutdown.

Extracted pure mapping and side-effectful read into `system-theme.ts`
for testability. Covered by 5 new tests.
When `autoThemeFollowsSystem` is on and the user manually picks a theme via
`set-theme`, route the persistence to `darkTheme` or `lightTheme` based on the
current macOS appearance instead of `theme`. Previously the choice was written
to `theme`, which the poll loop ignored, so the manual override was silently
overwritten on the next 3s tick.

Also re-load `darkTheme` / `lightTheme` from disk inside the poll cycle so an
override made via `set-theme` is picked up immediately on the next poll.

Drops the `saveConfig({ theme: desired })` write inside the poll loop — that
was clobbering the user's static-mode `theme` field whenever auto-follow ran.

Co-authored-by: Isaac
Two server hot-path optimizations targeting steady-state CPU and WS fan-out:

1. Replace the 3s `defaults read` polling loop with a kqueue-based file
   watch on `~/Library/Preferences/.GlobalPreferences.plist`. macOS rewrites
   that plist on any global-preference change, so we let the kernel push us
   the event and re-read appearance only when the value differs. A 60s
   safety poll covers the atomic-rename case where kqueue loses the inode.
   Removes ~28,800 subprocess spawns per server per day.

2. Hash the serialized state in `broadcastStateImmediate()` and skip the
   `server.publish("sidebar", msg)` when the payload is byte-identical to
   the previous send. Many call sites trigger broadcastState() but most do
   not actually change observable state (theme polls, focus moves to the
   same session, agent updates with identical metadata). Wire protocol is
   unchanged; new clients still receive `lastState` directly via the WS
   `open` handler.

Tests: 400/400 passing (was 397; +3 covering the watcher's no-op fallback,
idempotent stop, and darwin initial-fire).
`watcherBroadcastTimer` was referenced in `cleanup()` but never declared,
so every cleanup invocation (SIGINT/SIGTERM/30s idle timeout) crashed
with `ReferenceError: watcherBroadcastTimer is not defined` instead of
shutting down gracefully. Process exited anyway, so the bug was latent.
Just drop the dead line.
The ensure-sidebar HTTP handler currently triggers two `tmux list-panes -a`
calls per request:

1. ensureSidebarInWindow() lists panes to check for an existing sidebar
   (line 1206). Result cached for 300ms.
2. enforceSidebarWidth() invalidates the cache and lists again (line 1485)
   to walk panes for width enforcement.

The second list is redundant — the cache is at most ~30ms old when enforce
runs synchronously after the spawn check, and `tmux list-panes -a` is the
single most expensive call on a busy tmux (50-200ms with 30+ panes).

Add `reuseCache` to enforceSidebarWidth() and pass it from
ensureSidebarInWindow(). Halves the list-panes work on every session switch.

The standalone enforceSidebarWidth() callers (terminal resize, sidebar
toggle, width sync) keep the existing default behaviour.
eventTimestamps is a server-internal diagnostic — the tracker uses it for
stale/active heuristics — but it was being shipped to every WS client on
every state broadcast. The TUI never reads the field.

Each agent-emit appends a fresh timestamp, so back-to-back identical
opencode/claude-code/etc. status pings ("running" → "running" with the same
threadId) produced different state hashes and bypassed the broadcast dedup
introduced in a733f28. With chatty agents this fanned out one full state
JSON to every TUI client at ~1Hz per active session, visibly stalling the
foreground client when switching to an agent-heavy session.

Make the field optional in the SessionData shape and stop populating it in
computeState(). Tracker still maintains the timestamps server-side; they
just no longer escape onto the wire. Hash-dedup now correctly suppresses
no-op agent emits.
When the idle-timeout cleanup runs and a manual `toggle.sh ensure_server`
fires within the kernel's TIME_WAIT window (~30s on macOS), the new
process hits EADDRINUSE on port 7391 even though `lsof -iTCP:7391`
returns nothing — the socket is still reserved.

Setting `reusePort: true` on Bun.serve lets the new process bind
immediately. Single-server invariant is still enforced by PID file.
30s was too aggressive on restart paths. After any manual respawn or
plugin update, the TUI clients live inside sidebar panes that don't
exist yet — the user has to press the toggle key to spawn one, and that
exceeded the previous 30s window in normal flow. The server would
self-terminate before it could be useful, leading to a frustrating
loop where every M-s hit a stale server and re-triggered ensure_server.

5 min gives a comfortable usable window without leaving zombie servers
running indefinitely. Active servers with WS clients clear the timer
the instant the first TUI connects, so this change only affects the
no-clients case.
reusePort: true was the wrong fix for the EADDRINUSE-on-respawn issue.
On macOS Bun's reusePort lets multiple processes bind the same port and
share connections at the kernel level, but the opensessions runtime keeps
all session/agent/sidebar state in process-local memory. Allowing duplicate
servers meant clients round-robined across disjoint state and session
switches behaved erratically.

Revert reusePort and replace with a proper singleton guard: read PID_FILE,
probe the recorded pid with `process.kill(pid, 0)`, and exit cleanly if
that process is still alive. Stale PID files (process gone) are ignored
and the new server takes over normally.

This makes `toggle.sh ensure_server` idempotent — repeated invocations
during the kernel's TIME_WAIT window collapse to no-ops instead of racing
to spawn a second server.
Captures before/after numbers for the seven commits landed this session:
push-based theme watch, broadcast hash-dedup, enforce cache reuse,
eventTimestamps removal from the wire, idle-timeout bump, and the
singleton-guard / reusePort revert.

Headline: ~3.7× faster enforce dance on session switch, ~76% of agent-emit
broadcasts suppressed under chatty agents, theme polling subprocess work
eliminated, no more EADDRINUSE on respawn.
Three layered changes to bring rapid M-N session switching from
~1.5s/switch under load to ~60ms/switch:

1. Keybinding & script overhead

   - Bind keys via `run-shell -b` so tmux's input loop doesn't block
     while the shell script runs. Without -b, rapid presses serialise
     on tmux's command queue and reentrant tmux calls inside the
     script compound the wait.
   - Pass `#{client_tty}|#{session_name}|#{window_id}` as a positional
     arg from the bind-key, skipping a `tmux display-message` fork
     inside the script (~16ms saved).
   - server-common.sh: alive-check tmpfile (5s TTL) skips the
     ~80ms healthcheck curl on the hot path; plugin-dir / bun lookup
     deferred to the cold-start branch only.
   - Bump curl `-m 0.2` -> `-m 1.5`, `--connect-timeout 0.1` -> `0.3`,
     and `|| true; exit 0` so tmux's run-shell never surfaces curl
     exit 28 (timeout) in the status line during transient spikes.

2. Server-side event loop

   - Drop `invalidateCurrentSessionCache()` from broadcastStateImmediate.
     The 500ms TTL cache was being wiped on every broadcast (per
     agent-emit, dozens/sec under chatty watchers like a remote
     hermes plugin), defeating its purpose. handleFocus() still
     invalidates on real focus changes.
   - Swap raw getCurrentSession() -> getCachedCurrentSession() at
     the hot computeState/ensure-sidebar/toggle sites.
   - Debounce broadcastState (30ms) and refreshPaneAgents (250ms).
     Sub-perception windows; coalesce bursts that previously each
     triggered a fresh tmux subprocess (or a `tmux list-panes -a` +
     ps + lsof in the case of refreshPaneAgents).
   - Add AsyncReadCapable interface + isAsyncReadCapable guard.
     TmuxClient gains `runAsync` (Bun.spawn(...).exited) and async
     siblings: listSessionsAsync, listPanesAsync, listClientsAsync,
     getCurrentSessionAsync, getActiveSessionDirsAsync,
     getAllPaneCountsAsync. TmuxProvider implements them; its async
     listSessions runs the two underlying tmux subprocesses
     (list-sessions + active-session-dirs) in parallel via Promise.all.
   - computeState() and broadcastStateImmediate() are now async.
     Provider list calls in computeState run through Promise.all on
     the async API where available; legacy providers fall back to
     sync. broadcastStateImmediate uses an in-flight + trailing flag
     so concurrent triggers coalesce instead of stacking.

3. Bypass server for the switch action

   - Server writes ORDERING_FILE (/tmp/opensessions.<KEY>.ordering)
     on every state broadcast, one visible session name per line.
   - switch-index.sh fast-path: read line N from the file, call
     `tmux switch-client -t <name>` directly. The server-side POST
     still goes out (backgrounded) so the server can update unseen
     flags / sidebar focus / custom ordering, but the user-perceived
     latency is just one tmux fork. The existing tmux hook
     (client-session-changed -> POST /focus) keeps the server in sync.
   - Cold-path fallback: if the ordering file is missing (boot, or
     ordering not yet broadcast), behave as before.

Measured on an interactive setup with 4 sessions, 3 attached sidebars,
and a chatty remote hermes watcher pushing events 3-5/sec:

                                   before     after
  GET /  healthcheck                218ms     20ms
  POST /switch-index isolated      1033ms     54ms
  5 parallel POSTs total            timeout   32ms
  switch-index.sh end-to-end (warm) 270ms     60ms
  5 rapid-fire switches             ~1.5s     ~280ms

Stacked on PR Ataraxy-Labs#32 (perf+themes); merge order does not strictly matter
but the runtime/server index.ts changes overlap.

Co-authored-by: Isaac
Copy link
Copy Markdown

@inspect-review inspect-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect review

Triage: 75 entities analyzed | 0 critical, 0 high, 25 medium, 50 low
Verdict: standard_review

Findings (1)

  1. [low] cleanup() does not clear the new broadcast debounce timer, so broadcastStateImmediate() can still run after shutdown. Evidence: broadcastState() sets broadcastDebounceTimer via setTimeout(..., BROADCAST_DEBOUNCE_MS) in packages/runtime/src/server/index.ts, but cleanup() clears debounceTimer/sidebarEnforceTimer/etc. and never clears broadcastDebounceTimer.

Reviewed by inspect | Entity-level triage found 0 high-risk changes

@Palanikannan1437 Palanikannan1437 self-requested a review May 18, 2026 20:06
Preserve uncommitted work before porting these concepts onto the Rust/ratatui rewrite:
- tracker: hard-prune unseen+terminal agents after 15min (was: never pruned unseen)
- opencode watcher: evict locally-cached sessions after 15min with no DB hit
- themes: add tokyo-night-storm + tango-adapted
- tui: header bg crust -> base

Co-authored-by: Isaac
Copy link
Copy Markdown

@inspect-review inspect-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect review

Triage: 87 entities analyzed | 0 critical, 0 high, 31 medium, 56 low
Verdict: standard_review

Findings (7)

  1. [low] Race condition in broadcastStateImmediate: broadcastTrailing flag can be set after broadcastInFlight is cleared but before the queueMicrotask callback executes, causing the trailing broadcast to be lost
  2. [low] Incorrect background color in TUI: changed from P().crust to P().base, but this breaks the visual hierarchy since base is typically lighter/brighter than crust in most themes
  3. [low] Potential null pointer exception in switch-index.sh: TARGET variable can be empty if awk finds no match, but tmux switch-client is called unconditionally without checking if TARGET is non-empty
  4. [low] Stale cache data bug: getCachedCurrentSession() is used in computeState() but the cache is never invalidated on broadcastStateImmediate, meaning the currentSession in the broadcast can be stale for up to 500ms even after a real session switch
  5. [low] Asymmetric cache trust in enforceSidebarWidth: reuseCache option skips invalidateSidebarPaneCache() but the caller (ensureSidebarInWindow) assumes the cache is fresh based on a 300ms TTL that may have already expired by the time enforceSidebarWidth is called
  6. [low] Missing null check in switch-index.sh fast path: if ORDERING_FILE exists but is empty or malformed, TARGET will be empty and the fire-and-forget curl will POST an invalid context string with empty session name
  7. [low] Unhandled promise rejection: watchMacSystemAppearance callback is async (onChange: (mode) => void | Promise) but the watcher invokes it with void check() which swallows any promise rejection from the callback

Reviewed by inspect | Entity-level triage found 0 high-risk changes

…atch rejections

cleanup() cleared the existing timers but not the two debounce timers this
branch added (broadcastDebounceTimer, refreshPaneAgentsTimer), so a pending
broadcastStateImmediate() or refreshPaneAgents() could still fire after
shutdown. Clear both alongside the others.

watchMacSystemAppearance's check() is invoked as `void check()` at all three
call sites (file-watch event, safety poll, initial read), so a rejection from
the consumer's onChange callback became an unhandled promise rejection. Wrap
the check body in try/catch — the watch is best-effort and the next event or
safety poll retries.

Co-authored-by: Isaac
Copy link
Copy Markdown

@inspect-review inspect-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect review

Triage: 87 entities analyzed | 0 critical, 0 high, 31 medium, 56 low
Verdict: standard_review

Findings (7)

  1. [low] Race condition in broadcastStateImmediate: broadcastTrailing flag can be set after broadcastInFlight is cleared but before the function returns, causing the trailing broadcast to be lost. The queueMicrotask at the end checks broadcastTrailing, but another concurrent call could set it to true after the check but before the function exits.
  2. [low] Incorrect cache invalidation logic in broadcastStateImmediate: The comment says 'do NOT invalidate currentSession cache here' but the cache is never invalidated on focus changes either. handleFocus calls invalidateCurrentSessionCache() but then immediately calls getCachedCurrentSession() in computeState, which will return stale data if the cache TTL hasn't expired yet.
  3. [low] Potential null pointer exception in switch-index.sh: The script reads ORDERING_FILE and extracts TARGET with awk, but if awk returns empty string (index out of bounds), the script proceeds to call 'tmux switch-client -t ""' which will fail. The check '[ -n "$TARGET" ]' guards the fast path, but if it's empty, the script falls through to the cold path without any error handling.
  4. [low] Missing await in system theme watcher callback: The onChange callback in watchMacSystemAppearance is declared as '(mode) => void | Promise' but the internal check() function that calls it uses 'await onChange(mode)' inside a try-catch. However, the actual usage in server/index.ts passes 'syncSystemTheme' which is async, but the callback is invoked as 'void check()' from the file watcher and safety timer, meaning promise rejections could be unhandled if the try-catch doesn't catch them.
  5. [low] Incorrect pruning logic in AgentTracker.pruneTerminal: The code checks 'if (age <= TERMINAL_PRUNE_MS) continue;' which skips pruning for events YOUNGER than the threshold, but then checks 'if (isUnseen && age <= TERMINAL_HARD_PRUNE_MS) continue;' which also skips for events younger than the hard threshold. However, this means events between TERMINAL_PRUNE_MS and TERMINAL_HARD_PRUNE_MS that are NOT unseen will be pruned, but the comment says 'Seen + non-alive: prune after TERMINAL_PRUNE_MS (5 min)'. The logic should be: if seen and age > TERMINAL_PRUNE_MS, prune; if unseen and age > TERMINAL_HARD_PRUNE_MS, prune.
  6. [low] Singleton guard race condition: The PID file check uses process.kill(existingPid, 0) to probe if a process is alive, but between the probe and writing the new PID, another process could start and write its own PID. This creates a TOCTOU (time-of-check-time-of-use) race where multiple servers could all pass the check and start simultaneously.
  7. [low] Missing null check in computeState: The code maps over sessionLists[i] with 'for (const s of sessionLists[i]!)' using non-null assertion, but if a provider's listSessionsAsync() rejects, the Promise.all will reject and the entire function will throw. However, the code should handle partial failures gracefully since one provider failing shouldn't crash the entire state computation.

Reviewed by inspect | Entity-level triage found 0 high-risk changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant