Skip to content

feat(gateway): multi-replica support (session affinity + cron coordination)#132

Open
jacoblee-io wants to merge 1 commit into
mainfrom
feat/gateway-ha-v2
Open

feat(gateway): multi-replica support (session affinity + cron coordination)#132
jacoblee-io wants to merge 1 commit into
mainfrom
feat/gateway-ha-v2

Conversation

@jacoblee-io
Copy link
Copy Markdown
Collaborator

Summary

Enable Gateway to run multiple replicas in K8s without introducing Redis, using three complementary mechanisms:

  • Session affinity (Helm): ClientIP-based sticky sessions when replicas > 1, pod anti-affinity for spread, downward API env vars (SICLAW_POD_NAME, SICLAW_POD_IP)
  • Config cache TTL: 30s polling interval for CSP/metrics/SSO config so all replicas converge within 30s of a settings change
  • Cron coordination: Restore distributed scheduling from old CronCoordinator (commit 7c41fa6) — instance registration, heartbeat (15s), dead-instance detection (30s threshold), atomic least-loaded job claiming, stale job cancellation, and assigned job sync. Single-instance mode (instanceId=undefined) preserves existing behavior exactly.

Files changed

Area Files
Helm values.yaml, gateway-service.yaml, gateway-deployment.yaml
Config cache server.ts (30s TTL interval + cleanup in close())
Cron coordination cron-service.ts (reconcile loop, claim/cancel/sync)
Startup gateway-main.ts (instanceId resolution), server.ts (pass-through)

Known limitations (pre-existing, out of scope)

These issues exist independently of this PR and are tracked for follow-up:

Issue Severity Mitigation
Channel manager boots all channels on all replicas HIGH Needs distributed lock (follow-up PR)
UserStore in-memory cache has no cross-replica invalidation MEDIUM Needs TTL refresh or DB fallback (follow-up PR)
OAuth2 pendingStates in-memory LOW Session affinity covers most cases
Cron WS notifications go to owning replica only LOW Notification persisted in DB; UI polls

Test plan

  • npx tsc --noEmit passes
  • npm test — all tests pass
  • Single-replica regression: deploy with replicas: 1, verify cron jobs fire normally, config changes apply immediately, notifications delivered
  • Dual-replica: deploy with replicas: 2, verify cron_instances table shows both instances, cron_jobs.assigned_to distributes jobs, kill one pod and verify failover within 30s
  • Scale up/down: 1 → 2 → 3 → 1, verify no job loss or duplicate execution at each step

…n coordination

Enable Gateway to run multiple replicas without Redis by combining three
mechanisms:

1. Helm: session affinity (ClientIP) when replicas > 1, pod anti-affinity,
   and downward API env vars (SICLAW_POD_NAME, SICLAW_POD_IP)

2. Config cache TTL: 30s setInterval polls DB for CSP, metrics, and SSO
   config changes so non-local replicas converge within 30s

3. Cron coordination: restore distributed scheduling logic from the old
   CronCoordinator (commit 7c41fa6) into CronService — instance
   registration, heartbeat, dead-instance detection, atomic job claiming
   (least-loaded first), cancelStaleJobs, syncAssignedJobs. Single-instance
   mode (instanceId=undefined) preserves existing behavior exactly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant