Skip to content

[NOTIX] Close connection instead of noop-draining stale write ACKs#48

Draft
mattwd7 wants to merge 1 commit intobraze-mainfrom
NOTIX-avoid-hanging-noops
Draft

[NOTIX] Close connection instead of noop-draining stale write ACKs#48
mattwd7 wants to merge 1 commit intobraze-mainfrom
NOTIX-avoid-hanging-noops

Conversation

@mattwd7
Copy link
Copy Markdown

@mattwd7 mattwd7 commented Apr 22, 2026

Summary

Introduce @pending_write_count tracking on Dalli server connections and replace the unconditional noop drain with a threshold-based strategy: drain normally (fast, no reconnect) when few ACKs are pending, close and reconnect when the count is large enough to risk blocking ring.lock for seconds or minutes.

Context

We observed infrequent but severe Rack::Timeout::RequestTimeoutException errors across several production clusters. Datadog traces showed rails.cache GET spans taking 2 minutes 41 seconds, far exceeding Dalli's 450ms socket_timeout. The hanging span always pointed to a cache.read call whose call stack passed through Security::Permissions::Service#fetch_multi for developer permissions.

Root Cause

The mechanism requires three conditions occurring together:

1. Quiet-mode pipelining accumulates unread ACKs

Appboy::Cache#with_quiet_mutations sets Thread.current[:dalli_multi] = true for all write operations. In Dalli, this puts writes into "quiet" mode: each write command is sent to Memcached without reading the response. After the write, server.@pending_multi_response is set to true on that connection — a flag indicating unread ACKs are sitting in the socket's receive buffer.

For a PermissionService#fetch_multi call on a cold cache (e.g., after a deploy), dozens to hundreds of developer permission entries are written back to Memcached in quiet mode on the same connection. Each write leaves one ACK pending on that connection.

2. noop-drain blocks under ring.lock

When the next non-pipelined operation (e.g., cache.read('account_status')) arrives on the same connection that has @pending_multi_response = true, Dalli's Server#request detects it and calls noop() to drain all pending ACKs. noop() calls flush_response in a loop, reading and discarding one ACK per iteration until it receives the NOOP sentinel.

This entire drain loop executes inside ring.lock — Dalli's global mutex over the consistent-hash ring. While the drain is running, no other thread can make any Dalli request.

3. Slow Memcached responses amplify the hold time

Under load (or during periods of increased GC pressure from aggressive RUBY_GC_MALLOC_LIMIT settings), Memcached response times can creep toward the 450ms socket timeout. With N=hundreds of pending ACKs and 440ms per ACK, the drain takes N × 440ms = minutes. Every thread waiting for ring.lock during this time is blocked, and eventually Rack's 60-second request timeout fires.

Approach

Track @pending_write_count on each server connection and make the drain decision based on count.

When @pending_multi_response is true and a non-quiet operation arrives:

  • At or below MAX_SAFE_DRAIN_COUNT (10): drain with noop as before. This is the common case — the drain is fast (one round-trip per ACK, typically <5ms total) and avoids any TCP reconnect overhead. This preserves baseline performance for the vast majority of requests.
  • Above MAX_SAFE_DRAIN_COUNT: close the connection and reconnect via alive?(). This is the pathological case — draining would hold ring.lock for up to MAX_SAFE_DRAIN_COUNT × socket_timeout (10 × 450ms = 4.5s worst-case) before the threshold kicks in, and an unbounded O(N) for higher counts. Closing is semantically safe because quiet-mode write ACKs are fire-and-forget by design; the writes themselves have already been flushed to Memcached's TCP receive buffer. Reconnect cost is ~1ms (TCP handshake + VERSION command).

The threshold of 10 means:

  • Common case (1–5 writes per request, fast Memcached): no reconnect, latency unchanged.
  • Pathological case (100+ cold-cache permission writes, slow Memcached): one reconnect instead of a 2-minute ring.lock hold.

Why not always close instead of drain?

The first version of this fix replaced noop with unconditional close + reconnect. This was identified as problematic: since every write in Appboy::Cache goes through with_quiet_mutations, @pending_multi_response is set after virtually every write. With LIFO pool checkout behavior, the same connection is immediately reused for the next read, meaning a reconnect would occur on nearly every write→read transition throughout the application lifetime. This would substantially increase TCP churn and TIME_WAIT socket accumulation, potentially exhausting local ephemeral ports on high-traffic pods.

Testing

Three new integration tests (all running against a live Memcached process):

  • pending_write_count increments on quiet sets, resets on drain — uses Thread.current[:dalli_multi] directly to issue 3 quiet writes, asserts @pending_write_count = 3, then issues a read and asserts count resets to 0 with @pending_multi_response = false.
  • multi_response_start drains with noop when write count is at threshold — sets @pending_write_count = MAX_SAFE_DRAIN_COUNT directly, asserts noop is called once and the socket object is unchanged (no reconnect).
  • multi_response_start closes and reconnects when write count exceeds threshold — sets @pending_write_count = MAX_SAFE_DRAIN_COUNT + 1, asserts socket is a new object after the call.
  • Matching pair of tests for the request() path.

All 267 existing tests continue to pass.

Risks & Rollback

Reconnect frequency: Reconnects now only occur when @pending_write_count > 10, which in practice means requests that had a very large number of quiet writes on the same connection before a read. Normal requests with a handful of cache writes per request are unaffected and continue to use the noop drain path with no connection churn.

Worst-case drain time is now bounded: Even on the noop path, the maximum drain time is capped at MAX_SAFE_DRAIN_COUNT × socket_timeout = 4.5 seconds before close kicks in. This doesn't fully eliminate the problem but reduces its severity substantially for moderate write counts.

Blast radius: This change only activates when @pending_multi_response is true, which only happens after quiet-mode writes. Applications not using Dalli's quiet mode are completely unaffected.

Rollback: Revert this commit and deploy the previous tag. No data-format or protocol changes are involved.

…t is large

When a server has @pending_multi_response set (i.e. quiet-mode writes
were pipelined on this connection), the previous code called noop() to
drain all pending server ACKs before the next operation. noop() calls
flush_response in a loop — one read per pipelined write ACK — and runs
inside ring.lock, blocking every thread waiting for the lock.

In pathological conditions (many quiet writes accumulated + slow
Memcached node), this drain can take O(N × socket_timeout) seconds.
With N=hundreds and socket_timeout=450 ms, that easily exceeds a
60-second Rack timeout, producing long-running cache spans in Datadog
and ultimately Rack::Timeout::RequestTimeoutException.

Add @pending_write_count tracking to all quiet write operations (set,
add, replace, delete). In request() and multi_response_start(), compare
the count against MAX_SAFE_DRAIN_COUNT (10). When at or below the
threshold, drain with noop as before — this is the common case and costs
only one Memcached round-trip. When above the threshold, close the
connection instead.

Closing is safe because quiet-mode write ACKs are fire-and-forget by
design; the writes themselves have already been flushed to Memcached's
TCP receive buffer. The connection is re-established immediately via
alive?() before the pending operation proceeds. The threshold keeps
TCP churn at baseline levels for normal traffic while capping worst-case
hold time at MAX_SAFE_DRAIN_COUNT × socket_timeout (10 × 450ms = 4.5s
worst case) before the close path kicks in.

Made-with: Cursor
@mattwd7 mattwd7 force-pushed the NOTIX-avoid-hanging-noops branch from 08a643d to c8455e7 Compare April 22, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant