Skip to content

fix(CubeProxy): implement singleflight pattern for routing and health checks#135

Open
novahe wants to merge 1 commit into
TencentCloud:masterfrom
novahe:fix/faulty-backend-singleflight
Open

fix(CubeProxy): implement singleflight pattern for routing and health checks#135
novahe wants to merge 1 commit into
TencentCloud:masterfrom
novahe:fix/faulty-backend-singleflight

Conversation

@novahe
Copy link
Copy Markdown
Contributor

@novahe novahe commented May 4, 2026

Problem

In high-concurrency environments, CubeProxy suffered from multiple race conditions and logic flaws that degraded performance and introduced "dog-pile" effects (cache stampedes):

  1. Dead Faulty Backend Logic: The is_faulty_backend check was effectively disabled as the cache was never populated. This resulted in either redundant $O(N)$ Redis SMEMBERS scans or missed detection.
  2. Cache Stampedes (Thundering Herd): When local caches for backend routing metadata or faulty backend states expired, massive concurrent requests would simultaneously miss the cache and overwhelm Redis with redundant queries (HGETALL or SISMEMBER).
  3. Synchronized Cache Expiry: Worker processes lacked unique random seeds, causing cache TTL jitters to synchronize across workers, leading to simultaneous cache expirations.

Proposed Changes

  • Generic Singleflight Abstraction: Introduced utils:singleflight_do, a cross-worker synchronization mechanism inspired by Go's sync/singleflight. It uses resty.lock with a Double-Checked Locking pattern and xpcall with tracebacks for robust error handling.
  • Harden Metadata Resolution: Applied singleflight protection to get_backend_address. Redis HGETALL queries are now limited to at most one per key globally during cache misses.
  • Optimize Faulty Backend Detection: Switched to $O(1)$ SISMEMBER checks and fixed the missing cache population.
  • Dedicated Lock Dictionaries: Added local_cache_locks and faulty_backend_locks in nginx.conf to isolate lock states from data entries, preventing LRU eviction conflicts.
  • Seed Worker Randomness: Fixed random generator seeding in init_worker_phase.lua to ensure effective TTL jittering across all Nginx workers.

…end checks

- Abstracted singleflight logic into a reusable `utils:singleflight_do` method (Go-like interface).
- Optimized faulty backend checks by switching to O(1) SISMEMBER and preventing cache stampedes.
- Hardened sandbox metadata fetching in `get_backend_address` to prevent Redis thundering herd.
- Introduced dedicated shared memory lock dictionaries for better cache isolation.

Signed-off-by: novahe <heqianfly@gmail.com>
@novahe novahe force-pushed the fix/faulty-backend-singleflight branch from d5ecb25 to f52eda9 Compare May 4, 2026 02:38
@staryxchen
Copy link
Copy Markdown
Collaborator

Thanks for the PR.

Singleflight for routing — I agree this is valuable, but have concerns about introducing locks in the data path. CubeProxy sits on every request's critical path. resty.lock with a 5s timeout means that if Redis has a slow moment, all coroutines hitting the same key will queue up and block. This converts a brief Redis pressure spike into a deterministic latency cliff. For a data-plane component I'd prefer lock-free approaches.

In practice, the existing refresh-on-hit already prevents stampedes for active sandboxes (the vast majority). The real gap is: 1) Cold start — all workers start with empty cache simultaneously; 2) Synchronized TTL — workers share the same random seed.

math.randomseed — As analyzed above, this is a good catch👍

is_faulty_backend rework — Not caching the faulty state is intentional: we want to avoid a stale "faulty" flag in cache/Redis persistently rejecting requests after a backend recovers. The SMEMBERSSISMEMBER optimization is nice, but bundled with the caching change it alters the failure-recovery contract.

Removing refresh-on-hit — This overlaps with #133. Without live migration support I'd prefer to keep it for now.

Suggestion: Extract the math.randomseed fix and Replace SMEMBERS with SISMEMBER as two standalone PR separately. For the stampede protection, let's discuss lock-free alternatives (e.g. stale-while-revalidate).

@novahe
Copy link
Copy Markdown
Contributor Author

novahe commented May 4, 2026

@staryxchen Thanks for your thorough review! I've taken your advice and started splitting this PR into separate, more manageable ones.

I'll close this PR once all parts are moved. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants