[VQueues] Introducing inbox caching and read in batches#4674
[VQueues] Introducing inbox caching and read in batches#4674AhmedSoliman merged 1 commit intomainfrom
Conversation
Test Results 8 files + 1 8 suites +1 4m 47s ⏱️ + 2m 0s Results for commit 186893a. ± Comparison against base commit b128276. This pull request removes 4 and adds 7 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
|
Note to reviewers. This is current being extended to run async refills. I will merge the two PRs into one when ready. |
|
Thanks for the heads up Ahmed. Will wait on this to happen for the review. |
b6ae6ca to
b81bf5d
Compare
@AhmedSoliman did this change already land in this PR? |
|
Yes. This has been updated. |
5e001fb to
fbc586e
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b6fb18d9b1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
667d42e to
00b95db
Compare
tillrohrmann
left a comment
There was a problem hiding this comment.
Awesome work @AhmedSoliman 🚀 The new asynchronous queue iterator looks really great. As far as I can tell, the logic looks really solid. So +1 for merging :-)
Out of curiosity: Did you measure the impact of this change compared to the previous attempt of using tailing iterators? I know that tailing iterators have correctness problems so that's why we needed to get rid of them.
| // We re-create this reader on every refill, so a fresh snapshot is what | ||
| // we want. A tailing iterator would see new writes but is unsafe across | ||
| // memtable flushes. |
There was a problem hiding this comment.
Maybe add readopts.set_tailing(false) to give context to the comment.
| fn new_inbox_reader(&self, qid: &VQueueId, opts: Options) -> Self::InboxReader; | ||
| } | ||
|
|
||
| /// Iterator over vqueue entries | ||
| /// Iterator over "waiting inbox" vqueue entries |
There was a problem hiding this comment.
With inbox is the inbox stage of the inbox meant, right?
There was a problem hiding this comment.
Yes, I will rename the waiting reader to be inbox reader but in a separate commit to not mess up with the PR stack.
| RefillState::Standby { refill_anchor } => { | ||
| *refill_anchor = new_anchor; | ||
| } |
There was a problem hiding this comment.
Can the new anchor be larger than the existing one or can it only shrink?
There was a problem hiding this comment.
Both. update_anchor can grow (refill task completed and discovered new keys), shrink (cache eviction in enqueue reset to items.back()), or stay equal. Method is intentionally unconstrained
| // This branch handles when we don't have the item in cache. | ||
| // | ||
| // The removed item can be: | ||
| // - Cached |
There was a problem hiding this comment.
Didn't we just say that it's not in the cache in the line above?
| self.items.pop_back(); | ||
| self.items.insert(pos, (key, value)); | ||
| refill_anchor = self.items.back().map(|(k, _)| *k); | ||
| break; |
There was a problem hiding this comment.
Why is it ok to break here if we insert the sorted item at pos < self.items.len()? When we start with the algorithm self.items contains items that are strictly smaller than what we read from RocksDB and what we have in the overlay, right? When merge sorting into self.items we probably will only push to the end of self.items as items as well as overlay are both sorted, right? So could we enforce the invariant that assert_eq!(pos, self.items.len())?
| // Insert sorted in cache and ignore it if we already have it. | ||
| // If this item pushes us over the cache capcity, then we ignore it and reset | ||
| // the refill anchor to it. | ||
| let pos = match self.items.binary_search_by_key(&key, |&(k, _)| k) { |
There was a problem hiding this comment.
Why is a binary search needed here? Wouldn't we always push to the end of self.items what comes out of items.into_iter().merge_join_by(overlay)? Differently asked: What's the scenario were we wouldn't push to the end?
There was a problem hiding this comment.
binary_search redundant in poll_refill_task. Will rewrite.
| let head_key = match queue.head() { | ||
| Some(QueueItem::Inbox { key, .. }) => *key, | ||
| _ => panic!("expected inbox head"), | ||
| // It's very important is that we must reset the task to standby |
There was a problem hiding this comment.
| // It's very important is that we must reset the task to standby | |
| // It's very important that we must reset the task to standby |
| /// **Bug demonstration.** When the overlay is at capacity and the back | ||
| /// entry is a tombstone, `push_added_item`'s `pop_back` silently drops | ||
| /// that tombstone. The merge then lets the (already-deleted) row into | ||
| /// the cache. | ||
| /// | ||
| /// Layout at the moment of overflow: | ||
| /// | ||
| /// ```text | ||
| /// overlay[0] = Tombstone(seq=50) // pre-tombstone | ||
| /// overlay[1] = Tombstone(seq=100) // pre-tombstone | ||
| /// overlay[2..CAPACITY-1] = Add(seq=150..) // CAPACITY-3 adds | ||
| /// overlay[CAPACITY-1] = Tombstone(seq=500) // *r_target* | ||
| /// ``` | ||
| /// | ||
| /// The trigger add (seq=400) sorts at `pos = CAPACITY-1`, which is | ||
| /// `< overlay.len() == CAPACITY`, so `push_added_item` evicts the back | ||
| /// (`Tombstone(500)`) and inserts the trigger. After this, the overlay | ||
| /// holds only `[T(50), T(100), Add(150..), Add(400)]` — the tombstone | ||
| /// for `r_target` is gone. | ||
| /// | ||
| /// Storage is `[r_target=seq500]` (a single row that's been deleted but | ||
| /// is still in the in-flight task's snapshot, faithful to the race the | ||
| /// invariant doc on `RefillTask` describes). The merge produces: | ||
| /// | ||
| /// ```text | ||
| /// [T(50), T(100), Add(150..), Add(400), Left(seq500)] | ||
| /// ``` | ||
| /// | ||
| /// With the tombstone evicted there's nothing to suppress `Left(seq500)`, | ||
| /// so the merge inserts it into the cache. The drain then sees `seq500`, | ||
| /// which is the bug. | ||
| #[restate_core::test] | ||
| async fn tombstone_evicted_on_overlay_overflow_leaks_deleted_row() { |
There was a problem hiding this comment.
The description and the test method read as if the bug still exists but I guess that the horizon filters out the problematic entry. Maybe add a clarification that this problem is gone.
|
@tillrohrmann To answer your main question. No I've not run a comparative analysis with tailing iterators but we're steering away from it for correctness so we have no real option. |
Replaces the single-head cache with a sorted 24-entry per-queue cache. Refills run via tokio::task::spawn_blocking when data isn't in the block cache; in-flight notify_enqueued / notify_removed events are buffered as an overlay (Add / Tombstone) and merged on completion. On overlay overflow we set a horizon (exclusive upper bound) instead of back-eviction: a popped tombstone could otherwise re-admit a deleted row. Storage rows at or above the horizon are dropped from the merge and rediscovered on the next refill. Drops the tailing iterator and its workarounds. Splits VQueueCursor into inbox (returns CursorError; WouldBlock under non-blocking opts) and VQueueRunningCursor (sync).
Replaces the single-head cache with a sorted 24-entry per-queue cache.
Refills run via tokio::task::spawn_blocking when data isn't in the
block cache; in-flight notify_enqueued / notify_removed events are
buffered as an overlay (Add / Tombstone) and merged on completion.
On overlay overflow we set a horizon (exclusive upper bound) instead
of back-eviction: a popped tombstone could otherwise re-admit a
deleted row. Storage rows at or above the horizon are dropped from
the merge and rediscovered on the next refill.
Drops the tailing iterator and its workarounds. Splits VQueueCursor
into inbox (returns CursorError; WouldBlock under non-blocking opts)
and VQueueRunningCursor (sync).
Stack created with Sapling. Best reviewed with ReviewStack.