feat(gateway): multi-replica support (session affinity + cron coordination) by jacoblee-io · Pull Request #132 · scitix/siclaw

jacoblee-io · 2026-03-17T13:35:41Z

Summary

Enable Gateway to run multiple replicas in K8s without introducing Redis, using three complementary mechanisms:

Session affinity (Helm): ClientIP-based sticky sessions when replicas > 1, pod anti-affinity for spread, downward API env vars (SICLAW_POD_NAME, SICLAW_POD_IP)
Config cache TTL: 30s polling interval for CSP/metrics/SSO config so all replicas converge within 30s of a settings change
Cron coordination: Restore distributed scheduling from old CronCoordinator (commit 7c41fa6) — instance registration, heartbeat (15s), dead-instance detection (30s threshold), atomic least-loaded job claiming, stale job cancellation, and assigned job sync. Single-instance mode (instanceId=undefined) preserves existing behavior exactly.

Files changed

Area	Files
Helm	`values.yaml`, `gateway-service.yaml`, `gateway-deployment.yaml`
Config cache	`server.ts` (30s TTL interval + cleanup in `close()`)
Cron coordination	`cron-service.ts` (reconcile loop, claim/cancel/sync)
Startup	`gateway-main.ts` (instanceId resolution), `server.ts` (pass-through)

Known limitations (pre-existing, out of scope)

These issues exist independently of this PR and are tracked for follow-up:

Issue	Severity	Mitigation
Channel manager boots all channels on all replicas	HIGH	Needs distributed lock (follow-up PR)
UserStore in-memory cache has no cross-replica invalidation	MEDIUM	Needs TTL refresh or DB fallback (follow-up PR)
OAuth2 `pendingStates` in-memory	LOW	Session affinity covers most cases
Cron WS notifications go to owning replica only	LOW	Notification persisted in DB; UI polls

Test plan

npx tsc --noEmit passes
npm test — all tests pass
Single-replica regression: deploy with replicas: 1, verify cron jobs fire normally, config changes apply immediately, notifications delivered
Dual-replica: deploy with replicas: 2, verify cron_instances table shows both instances, cron_jobs.assigned_to distributes jobs, kill one pod and verify failover within 30s
Scale up/down: 1 → 2 → 3 → 1, verify no job loss or duplicate execution at each step

…n coordination Enable Gateway to run multiple replicas without Redis by combining three mechanisms: 1. Helm: session affinity (ClientIP) when replicas > 1, pod anti-affinity, and downward API env vars (SICLAW_POD_NAME, SICLAW_POD_IP) 2. Config cache TTL: 30s setInterval polls DB for CSP, metrics, and SSO config changes so non-local replicas converge within 30s 3. Cron coordination: restore distributed scheduling logic from the old CronCoordinator (commit 7c41fa6) into CronService — instance registration, heartbeat, dead-instance detection, atomic job claiming (least-loaded first), cancelStaleJobs, syncAssignedJobs. Single-instance mode (instanceId=undefined) preserves existing behavior exactly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gateway): multi-replica support (session affinity + cron coordination)#132

feat(gateway): multi-replica support (session affinity + cron coordination)#132
jacoblee-io wants to merge 1 commit into
mainfrom
feat/gateway-ha-v2

jacoblee-io commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jacoblee-io commented Mar 17, 2026

Summary

Files changed

Known limitations (pre-existing, out of scope)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant