[NOTIX] Close connection instead of noop-draining stale write ACKs#48
Draft
mattwd7 wants to merge 1 commit intobraze-mainfrom
Draft
[NOTIX] Close connection instead of noop-draining stale write ACKs#48mattwd7 wants to merge 1 commit intobraze-mainfrom
mattwd7 wants to merge 1 commit intobraze-mainfrom
Conversation
…t is large When a server has @pending_multi_response set (i.e. quiet-mode writes were pipelined on this connection), the previous code called noop() to drain all pending server ACKs before the next operation. noop() calls flush_response in a loop — one read per pipelined write ACK — and runs inside ring.lock, blocking every thread waiting for the lock. In pathological conditions (many quiet writes accumulated + slow Memcached node), this drain can take O(N × socket_timeout) seconds. With N=hundreds and socket_timeout=450 ms, that easily exceeds a 60-second Rack timeout, producing long-running cache spans in Datadog and ultimately Rack::Timeout::RequestTimeoutException. Add @pending_write_count tracking to all quiet write operations (set, add, replace, delete). In request() and multi_response_start(), compare the count against MAX_SAFE_DRAIN_COUNT (10). When at or below the threshold, drain with noop as before — this is the common case and costs only one Memcached round-trip. When above the threshold, close the connection instead. Closing is safe because quiet-mode write ACKs are fire-and-forget by design; the writes themselves have already been flushed to Memcached's TCP receive buffer. The connection is re-established immediately via alive?() before the pending operation proceeds. The threshold keeps TCP churn at baseline levels for normal traffic while capping worst-case hold time at MAX_SAFE_DRAIN_COUNT × socket_timeout (10 × 450ms = 4.5s worst case) before the close path kicks in. Made-with: Cursor
08a643d to
c8455e7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduce
@pending_write_counttracking on Dalli server connections and replace the unconditionalnoopdrain with a threshold-based strategy: drain normally (fast, no reconnect) when few ACKs are pending, close and reconnect when the count is large enough to risk blockingring.lockfor seconds or minutes.Context
We observed infrequent but severe
Rack::Timeout::RequestTimeoutExceptionerrors across several production clusters. Datadog traces showedrails.cache GETspans taking 2 minutes 41 seconds, far exceeding Dalli's 450mssocket_timeout. The hanging span always pointed to acache.readcall whose call stack passed throughSecurity::Permissions::Service#fetch_multifor developer permissions.Root Cause
The mechanism requires three conditions occurring together:
1. Quiet-mode pipelining accumulates unread ACKs
Appboy::Cache#with_quiet_mutationssetsThread.current[:dalli_multi] = truefor all write operations. In Dalli, this puts writes into "quiet" mode: each write command is sent to Memcached without reading the response. After the write,server.@pending_multi_responseis set totrueon that connection — a flag indicating unread ACKs are sitting in the socket's receive buffer.For a
PermissionService#fetch_multicall on a cold cache (e.g., after a deploy), dozens to hundreds of developer permission entries are written back to Memcached in quiet mode on the same connection. Each write leaves one ACK pending on that connection.2. noop-drain blocks under ring.lock
When the next non-pipelined operation (e.g.,
cache.read('account_status')) arrives on the same connection that has@pending_multi_response = true, Dalli'sServer#requestdetects it and callsnoop()to drain all pending ACKs.noop()callsflush_responsein a loop, reading and discarding one ACK per iteration until it receives the NOOP sentinel.This entire drain loop executes inside
ring.lock— Dalli's global mutex over the consistent-hash ring. While the drain is running, no other thread can make any Dalli request.3. Slow Memcached responses amplify the hold time
Under load (or during periods of increased GC pressure from aggressive
RUBY_GC_MALLOC_LIMITsettings), Memcached response times can creep toward the 450ms socket timeout. With N=hundreds of pending ACKs and 440ms per ACK, the drain takes N × 440ms = minutes. Every thread waiting forring.lockduring this time is blocked, and eventually Rack's 60-second request timeout fires.Approach
Track
@pending_write_counton each server connection and make the drain decision based on count.When
@pending_multi_responseis true and a non-quiet operation arrives:MAX_SAFE_DRAIN_COUNT(10): drain withnoopas before. This is the common case — the drain is fast (one round-trip per ACK, typically <5ms total) and avoids any TCP reconnect overhead. This preserves baseline performance for the vast majority of requests.MAX_SAFE_DRAIN_COUNT: close the connection and reconnect viaalive?(). This is the pathological case — draining would holdring.lockfor up toMAX_SAFE_DRAIN_COUNT × socket_timeout(10 × 450ms = 4.5s worst-case) before the threshold kicks in, and an unbounded O(N) for higher counts. Closing is semantically safe because quiet-mode write ACKs are fire-and-forget by design; the writes themselves have already been flushed to Memcached's TCP receive buffer. Reconnect cost is ~1ms (TCP handshake + VERSION command).The threshold of 10 means:
ring.lockhold.Why not always close instead of drain?
The first version of this fix replaced
noopwith unconditionalclose + reconnect. This was identified as problematic: since every write inAppboy::Cachegoes throughwith_quiet_mutations,@pending_multi_responseis set after virtually every write. With LIFO pool checkout behavior, the same connection is immediately reused for the next read, meaning a reconnect would occur on nearly every write→read transition throughout the application lifetime. This would substantially increase TCP churn and TIME_WAIT socket accumulation, potentially exhausting local ephemeral ports on high-traffic pods.Testing
Three new integration tests (all running against a live Memcached process):
pending_write_countincrements on quiet sets, resets on drain — usesThread.current[:dalli_multi]directly to issue 3 quiet writes, asserts@pending_write_count = 3, then issues a read and asserts count resets to 0 with@pending_multi_response = false.multi_response_startdrains with noop when write count is at threshold — sets@pending_write_count = MAX_SAFE_DRAIN_COUNTdirectly, assertsnoopis called once and the socket object is unchanged (no reconnect).multi_response_startcloses and reconnects when write count exceeds threshold — sets@pending_write_count = MAX_SAFE_DRAIN_COUNT + 1, asserts socket is a new object after the call.request()path.All 267 existing tests continue to pass.
Risks & Rollback
Reconnect frequency: Reconnects now only occur when
@pending_write_count > 10, which in practice means requests that had a very large number of quiet writes on the same connection before a read. Normal requests with a handful of cache writes per request are unaffected and continue to use the noop drain path with no connection churn.Worst-case drain time is now bounded: Even on the noop path, the maximum drain time is capped at
MAX_SAFE_DRAIN_COUNT × socket_timeout= 4.5 seconds before close kicks in. This doesn't fully eliminate the problem but reduces its severity substantially for moderate write counts.Blast radius: This change only activates when
@pending_multi_responseis true, which only happens after quiet-mode writes. Applications not using Dalli's quiet mode are completely unaffected.Rollback: Revert this commit and deploy the previous tag. No data-format or protocol changes are involved.