[VQueues] Introducing inbox caching and read in batches by AhmedSoliman · Pull Request #4674 · restatedev/restate

AhmedSoliman · 2026-04-29T11:33:50Z

Replaces the single-head cache with a sorted 24-entry per-queue cache.
Refills run via tokio::task::spawn_blocking when data isn't in the
block cache; in-flight notify_enqueued / notify_removed events are
buffered as an overlay (Add / Tombstone) and merged on completion.

On overlay overflow we set a horizon (exclusive upper bound) instead
of back-eviction: a popped tombstone could otherwise re-admit a
deleted row. Storage rows at or above the horizon are dropped from
the merge and rediscovered on the next refill.

Drops the tailing iterator and its workarounds. Splits VQueueCursor
into inbox (returns CursorError; WouldBlock under non-blocking opts)
and VQueueRunningCursor (sync).

Stack created with Sapling. Best reviewed with ReviewStack.

github-actions · 2026-04-29T11:50:50Z

Test Results

8 files + 1 8 suites +1 4m 47s ⏱️ + 2m 0s
50 tests + 3 50 ✅ + 3 0 💤 ±0 0 ❌ ±0
218 runs +18 218 ✅ +18 0 💤 ±0 0 ❌ ±0

Results for commit 186893a. ± Comparison against base commit b128276.

This pull request removes 4 and adds 7 tests. Note that renamed tests count towards both.

dev.restate.sdktesting.tests.KafkaIngress ‑ handleEventInCounterService(URI, int, Client)
dev.restate.sdktesting.tests.KafkaIngress ‑ handleEventInEventHandler(URI, int, Client)
dev.restate.sdktesting.tests.UpgradeWithInFlightInvocation ‑ inFlightInvocation(Client, URI)
dev.restate.sdktesting.tests.UpgradeWithNewInvocation ‑ executesNewInvocationWithLatestServiceRevisions(Client, URI)

dev.restate.sdktesting.tests.Combinators ‑ awakeableOrTimeoutUsingAwaitAny(Client)
dev.restate.sdktesting.tests.Combinators ‑ firstSuccessfulCompletedAwakeable(Client)
dev.restate.sdktesting.tests.Custom ‑ run(CustomTestConfig, URI, URI)[1]
dev.restate.sdktesting.tests.UserErrors ‑ failSeveralTimesWithMetadata(URI)
dev.restate.sdktesting.tests.UserErrors ‑ internalCallFailurePropagationWithMetadata(URI)
dev.restate.sdktesting.tests.UserErrors ‑ invokeTerminallyFailingCallWithMetadata(URI)
dev.restate.sdktesting.tests.UserErrors ‑ sideEffectWithTerminalErrorWithMetadata(URI)

♻️ This comment has been updated with latest results.

AhmedSoliman · 2026-04-30T16:35:39Z

Note to reviewers. This is current being extended to run async refills. I will merge the two PRs into one when ready.

tillrohrmann · 2026-04-30T18:50:24Z

Thanks for the heads up Ahmed. Will wait on this to happen for the review.

tillrohrmann · 2026-05-05T21:29:18Z

Note to reviewers. This is current being extended to run async refills. I will merge the two PRs into one when ready.

@AhmedSoliman did this change already land in this PR?

AhmedSoliman · 2026-05-06T06:13:06Z

Yes. This has been updated.

AhmedSoliman · 2026-05-07T08:58:18Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b6fb18d9b1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tillrohrmann

Awesome work @AhmedSoliman 🚀 The new asynchronous queue iterator looks really great. As far as I can tell, the logic looks really solid. So +1 for merging :-)

Out of curiosity: Did you measure the impact of this change compared to the previous attempt of using tailing iterators? I know that tailing iterators have correctness problems so that's why we needed to get rid of them.

tillrohrmann · 2026-05-07T14:20:10Z

+        // We re-create this reader on every refill, so a fresh snapshot is what
+        // we want. A tailing iterator would see new writes but is unsafe across
+        // memtable flushes.


Maybe add readopts.set_tailing(false) to give context to the comment.

tillrohrmann · 2026-05-07T14:44:02Z

+    fn new_inbox_reader(&self, qid: &VQueueId, opts: Options) -> Self::InboxReader;
 }

-/// Iterator over vqueue entries
+/// Iterator over "waiting inbox" vqueue entries


With inbox is the inbox stage of the inbox meant, right?

Yes, I will rename the waiting reader to be inbox reader but in a separate commit to not mess up with the PR stack.

tillrohrmann · 2026-05-07T15:07:13Z

+            RefillState::Standby { refill_anchor } => {
+                *refill_anchor = new_anchor;
+            }


Can the new anchor be larger than the existing one or can it only shrink?

Both. update_anchor can grow (refill task completed and discovered new keys), shrink (cache eviction in enqueue reset to items.back()), or stay equal. Method is intentionally unconstrained

tillrohrmann · 2026-05-07T15:16:35Z

+            // This branch handles when we don't have the item in cache.
+            //
+            // The removed item can be:
+            // - Cached


Didn't we just say that it's not in the cache in the line above?

tillrohrmann · 2026-05-07T18:51:34Z

+                        self.items.pop_back();
+                        self.items.insert(pos, (key, value));
+                        refill_anchor = self.items.back().map(|(k, _)| *k);
+                        break;


Why is it ok to break here if we insert the sorted item at pos < self.items.len()? When we start with the algorithm self.items contains items that are strictly smaller than what we read from RocksDB and what we have in the overlay, right? When merge sorting into self.items we probably will only push to the end of self.items as items as well as overlay are both sorted, right? So could we enforce the invariant that assert_eq!(pos, self.items.len())?

tillrohrmann · 2026-05-07T18:53:34Z

+                    // Insert sorted in cache and ignore it if we already have it.
+                    // If this item pushes us over the cache capcity, then we ignore it and reset
+                    // the refill anchor to it.
+                    let pos = match self.items.binary_search_by_key(&key, |&(k, _)| k) {


Why is a binary search needed here? Wouldn't we always push to the end of self.items what comes out of items.into_iter().merge_join_by(overlay)? Differently asked: What's the scenario were we wouldn't push to the end?

binary_search redundant in poll_refill_task. Will rewrite.

tillrohrmann · 2026-05-07T18:54:31Z

-        let head_key = match queue.head() {
-            Some(QueueItem::Inbox { key, .. }) => *key,
-            _ => panic!("expected inbox head"),
+        // It's very important is that we must reset the task to standby


Suggested change

// It's very important is that we must reset the task to standby

// It's very important that we must reset the task to standby

tillrohrmann · 2026-05-07T20:44:12Z

+/// **Bug demonstration.** When the overlay is at capacity and the back
+/// entry is a tombstone, `push_added_item`'s `pop_back` silently drops
+/// that tombstone. The merge then lets the (already-deleted) row into
+/// the cache.
+///
+/// Layout at the moment of overflow:
+///
+/// ```text
+/// overlay[0]                     = Tombstone(seq=50)        // pre-tombstone
+/// overlay[1]                     = Tombstone(seq=100)       // pre-tombstone
+/// overlay[2..CAPACITY-1]         = Add(seq=150..)           // CAPACITY-3 adds
+/// overlay[CAPACITY-1]            = Tombstone(seq=500)       // *r_target*
+/// ```
+///
+/// The trigger add (seq=400) sorts at `pos = CAPACITY-1`, which is
+/// `< overlay.len() == CAPACITY`, so `push_added_item` evicts the back
+/// (`Tombstone(500)`) and inserts the trigger. After this, the overlay
+/// holds only `[T(50), T(100), Add(150..), Add(400)]` — the tombstone
+/// for `r_target` is gone.
+///
+/// Storage is `[r_target=seq500]` (a single row that's been deleted but
+/// is still in the in-flight task's snapshot, faithful to the race the
+/// invariant doc on `RefillTask` describes). The merge produces:
+///
+/// ```text
+/// [T(50), T(100), Add(150..), Add(400), Left(seq500)]
+/// ```
+///
+/// With the tombstone evicted there's nothing to suppress `Left(seq500)`,
+/// so the merge inserts it into the cache. The drain then sees `seq500`,
+/// which is the bug.
+#[restate_core::test]
+async fn tombstone_evicted_on_overlay_overflow_leaks_deleted_row() {


The description and the test method read as if the bug still exists but I guess that the horizon filters out the problematic entry. Maybe add a clarification that this problem is gone.

Will clarify.

AhmedSoliman · 2026-05-08T12:44:45Z

@tillrohrmann To answer your main question. No I've not run a comparative analysis with tailing iterators but we're steering away from it for correctness so we have no real option.

Replaces the single-head cache with a sorted 24-entry per-queue cache. Refills run via tokio::task::spawn_blocking when data isn't in the block cache; in-flight notify_enqueued / notify_removed events are buffered as an overlay (Add / Tombstone) and merged on completion. On overlay overflow we set a horizon (exclusive upper bound) instead of back-eviction: a popped tombstone could otherwise re-admit a deleted row. Storage rows at or above the horizon are dropped from the merge and rediscovered on the next refill. Drops the tailing iterator and its workarounds. Splits VQueueCursor into inbox (returns CursorError; WouldBlock under non-blocking opts) and VQueueRunningCursor (sync).

AhmedSoliman requested a review from tillrohrmann April 29, 2026 11:57

AhmedSoliman force-pushed the pr4674 branch from e43f89e to 6246be5 Compare May 5, 2026 13:29

AhmedSoliman mentioned this pull request May 5, 2026

[VQueues] Introduce sys_vqueues datafusion table #4635

Open

AhmedSoliman force-pushed the pr4674 branch 3 times, most recently from b6ae6ca to b81bf5d Compare May 5, 2026 14:24

AhmedSoliman modified the milestone: 1.7 May 5, 2026

AhmedSoliman force-pushed the pr4674 branch from b81bf5d to 3e07b40 Compare May 5, 2026 14:36

AhmedSoliman force-pushed the pr4674 branch 2 times, most recently from 5e001fb to fbc586e Compare May 6, 2026 10:05

AhmedSoliman mentioned this pull request May 6, 2026

[VQueues] Share vqueues cache between scheduler and RSM #4699

Merged

AhmedSoliman force-pushed the pr4674 branch from fbc586e to b6fb18d Compare May 6, 2026 19:59

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

Comment thread crates/vqueues/src/scheduler/queue.rs Outdated

Comment thread crates/vqueues/src/scheduler/queue.rs

AhmedSoliman force-pushed the pr4674 branch 2 times, most recently from 667d42e to 00b95db Compare May 7, 2026 12:57

AhmedSoliman mentioned this pull request May 7, 2026

[VQueues] Commands for pause/resume in WAL protocol #4701

Draft

AhmedSoliman force-pushed the pr4674 branch from 00b95db to 186893a Compare May 7, 2026 14:06

tillrohrmann approved these changes May 7, 2026

View reviewed changes

AhmedSoliman force-pushed the pr4674 branch from 186893a to 672f6d2 Compare May 8, 2026 13:06

AhmedSoliman merged commit 672f6d2 into main May 8, 2026
40 checks passed

AhmedSoliman deleted the pr4674 branch May 8, 2026 13:13

github-actions Bot locked and limited conversation to collaborators May 8, 2026

	// It's very important is that we must reset the task to standby
	// It's very important that we must reset the task to standby

Conversation

AhmedSoliman commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

AhmedSoliman commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tillrohrmann commented Apr 30, 2026

Uh oh!

tillrohrmann commented May 5, 2026

Uh oh!

AhmedSoliman commented May 6, 2026

Uh oh!

AhmedSoliman commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AhmedSoliman May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AhmedSoliman commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AhmedSoliman commented Apr 29, 2026 •

edited

Loading

github-actions Bot commented Apr 29, 2026 •

edited

Loading

AhmedSoliman commented Apr 30, 2026 •

edited

Loading

AhmedSoliman May 8, 2026 •

edited

Loading