Skip to content

Conversation

@armstrong0704
Copy link

Problem

Previously, flushing the EphemeralFile to disk held a write lock on the struct. During slow disk IO (10ms - 50ms+), this blocked all concurrent readers (GetPage), causing latency spikes and throughput collapse.

See #12172 for the full context.

Summary of changes

  1. Decouple the 'Frozen' flush buffer from the 'Mutable' tail.
  2. Use arc_swap::ArcSwap to publish the frozen buffer atomically. Readers access it wait-free.
  3. Add disk_committed_offset to distinguish between data in the buffer vs. disk, preventing race conditions.
  4. Add writer_lock to serialize concurrent writers (WAL append-only integrity) without blocking readers.

Verification & Benchmarks (Local NVMe + Synthetic 50ms Latency):

@armstrong0704 armstrong0704 marked this pull request as ready for review November 30, 2025 05:47
@armstrong0704 armstrong0704 requested a review from a team as a code owner November 30, 2025 05:47
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey. Thanks for the PR. Appreciate you taking an interest in this and taking the time to work on this.

In the current form the changes here are very broad. This part of the code base is rather delicate (there's a number of non-obvious invariants) and we've tried to incrementally evolve it over time with particular attention to correctness. As it stands, convincing myself this is all correct would take a substantial amount of time and it carries a lot of risk.

If you're interested I suggest a rework that preserves the overall code structure and focuses on the issue of holding the writer lock while reading from the flush buffer.
I think changing BufferedWriter::maybe_flushed to ArcSwap<Option<FullSlice<B::IoBuf>>> and making BufferedWriter::inspect_maybe_flushed to return the current value in the arc swap would be sufficient. Then you could drop the write lock after reading from the mutable buffer.

Refactors BufferedWriter to use ArcSwap for the flushing buffer, enabling wait-free reads in EphemeralFile.

Problem:
When EphemeralFile flushes to disk, it holds a write lock on the BufferedWriter. During slow disk I/O (10ms+), this blocks all concurrent GetPage requests (readers), causing latency spikes.

Solution:
1. Modified BufferedWriter to store maybe_flushed in an ArcSwap.
2. Updated flush() to atomically publish the buffer to readers before blocking on the underlying write operation.
3. Updated EphemeralFile to check this atomic pointer lock-free before falling back to the mutex.

Outcome:
- Eliminates read stalls: Readers continue serving hits from the flushing buffer while disk I/O blocks the writer.
- Preserves architecture: Keeps the existing flow-control and pipelining logic of BufferedWriter (addressing review feedback).
- Verified: ~28x improvement in read throughput under simulated latency.

Fixes neondatabase#12172
@armstrong0704 armstrong0704 force-pushed the fix/ephemeral-file-arcswap branch from 7ec3202 to 643c929 Compare December 6, 2025 06:23
@armstrong0704
Copy link
Author

@VladLazar Thanks for the review and recommendations. I have updated the PR based on your feedback.

  • Refactor: I moved the ArcSwap logic inside BufferedWriter instead of rewriting EphemeralFile. This preserves the existing write pipelining and flow control while exposing the flushing buffer wait-free.

  • Performance: The results are even better with this cleaner abstraction. Benchmarks under simulated load (50ms disk latency) show a ~28x improvement in read throughput compared to main (readers remain completely non-blocking during flushes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pageserver: improve IO concurrency for in-memory layers

2 participants