Skip to content

v4.0: xdp: switch to lpm trie for routing (backport of #11465)#11538

Open
mergify[bot] wants to merge 1 commit intov4.0from
mergify/bp/v4.0/pr-11465
Open

v4.0: xdp: switch to lpm trie for routing (backport of #11465)#11538
mergify[bot] wants to merge 1 commit intov4.0from
mergify/bp/v4.0/pr-11465

Conversation

@mergify
Copy link
Copy Markdown

@mergify mergify bot commented Mar 25, 2026

The initial routing implementation in agave-xdp assumed a static routing table with only a handful of entries (one per interface + default).

DZ breaks that assumption by installing hundreds of routes (~550 on mnb today). The result is that routing when DZ is on doesn't work at all, the XDP thread is pegged at 100% cpu routing, which leads to skipped blocks and the XDP channel filling up.

This PR fixes the issue by reworking routing and implementing a 4-bit (nibble) trie for lookups.

before

Screenshot 2026-03-22 at 9 28 12 pm

after

Screenshot 2026-03-22 at 9 26 38 pm
This is an automatic backport of pull request #11465 done by [Mergify](https://mergify.com).

@mergify mergify bot requested a review from a team as a code owner March 25, 2026 07:27
@mergify mergify bot added the conflicts label Mar 25, 2026
@mergify
Copy link
Copy Markdown
Author

mergify bot commented Mar 25, 2026

Cherry-pick of da1dcd7 has failed:

On branch mergify/bp/v4.0/pr-11465
Your branch is up to date with 'origin/v4.0'.

You are currently cherry-picking commit da1dcd767.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   xdp/src/lib.rs
	new file:   xdp/src/lpm.rs
	modified:   xdp/src/netlink.rs
	modified:   xdp/src/route.rs
	modified:   xdp/src/route_monitor.rs

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   xdp/src/xdp_retransmitter.rs

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

* xdp: refactor router

Split RouterTables from Router. Make RouteMonitor update RouterTables
and then periodically build a Router from a snapshot of RouterTables.

This is a preparatory change before introducing a pre built radix tree
for routing.

* xdp: store Route<T> in router

Instead of storing RouteEntry, store a specialized Route<T> where T is
the address type - v4 or v6. Implement only v4 for now.

This allows moving address family branching at build time instead of
lookup time. It also reduces the memory used by each entry making lookup
more cache friendly.

This change removes ipv6 routing which was only half working anyway.

* xdp: add simple (but fast) longest prefix router

Switch the router to use a compressed nibble radix tree. The tree is
immutable and optimized for lookup speed. It is periodically rebuilt
when the routing table changes.

* xdp: explicitly use the main routing table only

It was already the case that we only supported the main table. This
makes it explicit and adds the necessary plumbing to support other
tables in the future.
@alessandrod alessandrod force-pushed the mergify/bp/v4.0/pr-11465 branch from 767dea1 to 1c92675 Compare March 25, 2026 08:43
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.82490% with 89 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.9%. Comparing base (ca034bd) to head (1c92675).

Additional details and impacted files
@@            Coverage Diff            @@
##             v4.0   #11538     +/-   ##
=========================================
- Coverage    83.0%    82.9%   -0.1%     
=========================================
  Files         838      839      +1     
  Lines      316793   317165    +372     
=========================================
- Hits       263229   263223      -6     
- Misses      53564    53942    +378     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gregcusack gregcusack self-requested a review March 25, 2026 14:44
Copy link
Copy Markdown

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

}
atomic_router.store(Arc::new(router));
self.last_publish = Instant::now();
self.dirty = false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we conditionally set this false on success only?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as below, from_tables essentially never fails.

Even if it failed, from_tables(tables) with the same tables would fail in exactly the same way, unless the tables change, in which case dirty=true.

Comment on lines +169 to +170
.expect("error creating RoutingTables");
let router = Router::from_tables(tables.clone()).expect("error creating Router");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can never fail

inside from_tables we do

        router.cached_default_route = match router.default_route() {
            Ok(hop) => Some(hop),
            Err(RouteError::NoRouteFound(_)) => None,
            Err(e) => return Err(io::Error::other(e)),
        };

        Ok(router)

and default_route() fails only with NoRouteFound.

I'll rework the errors in master so this gets less scary but yeah it's effectively an impossible condition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants