fix(CubeProxy): implement singleflight pattern for routing and health checks by novahe · Pull Request #135 · TencentCloud/CubeSandbox

novahe · 2026-05-04T02:24:42Z

Problem

In high-concurrency environments, CubeProxy suffered from multiple race conditions and logic flaws that degraded performance and introduced "dog-pile" effects (cache stampedes):

Dead Faulty Backend Logic: The is_faulty_backend check was effectively disabled as the cache was never populated. This resulted in either redundant $O(N)$ Redis SMEMBERS scans or missed detection.
Cache Stampedes (Thundering Herd): When local caches for backend routing metadata or faulty backend states expired, massive concurrent requests would simultaneously miss the cache and overwhelm Redis with redundant queries (HGETALL or SISMEMBER).
Synchronized Cache Expiry: Worker processes lacked unique random seeds, causing cache TTL jitters to synchronize across workers, leading to simultaneous cache expirations.

Proposed Changes

Generic Singleflight Abstraction: Introduced utils:singleflight_do, a cross-worker synchronization mechanism inspired by Go's sync/singleflight. It uses resty.lock with a Double-Checked Locking pattern and xpcall with tracebacks for robust error handling.
Harden Metadata Resolution: Applied singleflight protection to get_backend_address. Redis HGETALL queries are now limited to at most one per key globally during cache misses.
Optimize Faulty Backend Detection: Switched to $O(1)$ SISMEMBER checks and fixed the missing cache population.
Dedicated Lock Dictionaries: Added local_cache_locks and faulty_backend_locks in nginx.conf to isolate lock states from data entries, preventing LRU eviction conflicts.
Seed Worker Randomness: Fixed random generator seeding in init_worker_phase.lua to ensure effective TTL jittering across all Nginx workers.

…end checks - Abstracted singleflight logic into a reusable `utils:singleflight_do` method (Go-like interface). - Optimized faulty backend checks by switching to O(1) SISMEMBER and preventing cache stampedes. - Hardened sandbox metadata fetching in `get_backend_address` to prevent Redis thundering herd. - Introduced dedicated shared memory lock dictionaries for better cache isolation. Signed-off-by: novahe <heqianfly@gmail.com>

staryxchen · 2026-05-04T09:49:07Z

Thanks for the PR.

Singleflight for routing — I agree this is valuable, but have concerns about introducing locks in the data path. CubeProxy sits on every request's critical path. resty.lock with a 5s timeout means that if Redis has a slow moment, all coroutines hitting the same key will queue up and block. This converts a brief Redis pressure spike into a deterministic latency cliff. For a data-plane component I'd prefer lock-free approaches.

In practice, the existing refresh-on-hit already prevents stampedes for active sandboxes (the vast majority). The real gap is: 1) Cold start — all workers start with empty cache simultaneously; 2) Synchronized TTL — workers share the same random seed.

math.randomseed — As analyzed above, this is a good catch👍

 is_faulty_backend rework — Not caching the faulty state is intentional: we want to avoid a stale "faulty" flag in cache/Redis persistently rejecting requests after a backend recovers. The SMEMBERS → SISMEMBER optimization is nice, but bundled with the caching change it alters the failure-recovery contract.

Removing refresh-on-hit — This overlaps with #133. Without live migration support I'd prefer to keep it for now.

Suggestion: Extract the math.randomseed fix and Replace SMEMBERS with SISMEMBER as two standalone PR separately. For the stampede protection, let's discuss lock-free alternatives (e.g. stale-while-revalidate).

novahe · 2026-05-04T12:13:42Z

@staryxchen Thanks for your thorough review! I've taken your advice and started splitting this PR into separate, more manageable ones.

The math.randomseed fix is now here: fix(CubeProxy): initialize random seed in worker phase #138
The is_faulty_backend optimization using SISMEMBER is now here: refactor(CubeProxy): remove unused faulty_backend_set check #139
I'll follow up with a standalone PR for the lock-free routing optimization (SWR) soon.

I'll close this PR once all parts are moved. Thanks!

novahe requested review from chenhengqi, fslongjin, ls-ggg, tinklone and up2wing as code owners May 4, 2026 02:24

novahe force-pushed the fix/faulty-backend-singleflight branch from d5ecb25 to f52eda9 Compare May 4, 2026 02:38

This was referenced May 4, 2026

fix(CubeProxy): initialize random seed in worker phase #138

Merged

refactor(CubeProxy): remove unused faulty_backend_set check #139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(CubeProxy): implement singleflight pattern for routing and health checks#135

fix(CubeProxy): implement singleflight pattern for routing and health checks#135
novahe wants to merge 1 commit into
TencentCloud:masterfrom
novahe:fix/faulty-backend-singleflight

novahe commented May 4, 2026 •

edited

Loading

Uh oh!

staryxchen commented May 4, 2026

Uh oh!

novahe commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

novahe commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Proposed Changes

Uh oh!

staryxchen commented May 4, 2026

Uh oh!

novahe commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

novahe commented May 4, 2026 •

edited

Loading