Skip to content

ts_runtime: fix logic for derp home region change#242

Open
dylan-tailscale wants to merge 2 commits into
mainfrom
dylan/multiderp-control-restart
Open

ts_runtime: fix logic for derp home region change#242
dylan-tailscale wants to merge 2 commits into
mainfrom
dylan/multiderp-control-restart

Conversation

@dylan-tailscale

Copy link
Copy Markdown
Collaborator

Fixes logic in Multiderp to properly deselect the current home region (if any) before selecting the new home region, and adds some resilience around missing regions in the map.

This was root cause of #26. We would perform DERP latency testing and correctly re-report the home region to the control plane when re-connecting to the netmap stream, whether the home region had changed or not. However, in Multiderp's message handler for DerpLatencyMeasurement when the home region was already set and hadn't changed, we would incorrectly de-select the existing DERP home region without ever re-selecting it. I had discovered this awhile ago, but missed it as the root cause because it was masked by the bug fixed in #239.

AI disclosure: I used Claude Code to help diagnose/prototype a fix, but rewrote all the code by hand.

Closes #26.

Fixes logic in `Multiderp` to properly deselect the current home region (if any) before selecting the new home region, and adds some resilience around missing regions in the map.

Signed-off-by: Dylan Bargatze <dylan@tailscale.com>
Comment thread ts_runtime/src/multiderp.rs Outdated
Comment on lines +335 to +346
// We now know the home derp region has changed. Sanity check - ensure that both the current
// and new home derp regions exist in the region map. Doing this ahead of time and exiting
// early avoids nasty unwind logic if we detect a missing region when we try to change it.
if !self.derps.contains_key(&new_region_id) {
tracing::error!(%new_region_id, "new home derp region missing in map, not updating");
return;
} else if let Some(cur_region_id) = self.current_home_derp
&& !self.derps.contains_key(&cur_region_id)
{
tracing::error!(%cur_region_id, "current home derp region missing in map, not updating");
return;
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to a cleaner approach here, if anyone has ideas. Without the sanity checks, in the case where the new region isn't in self.derps, the code below would have to get the current region again with self.derps.get_mut() to call .send_replace(true) and say "just kidding, you actually are the home derp region again, because the new region doesn't exist in the region map." This also runs into some problems with mutable borrows of self.derps and the lifetimes of the new/current regions.

All that said, it irks me that we're calling self.derps.contains_key() just to go ahead and get the regions with get_mut().unwrap(); missing regions should be rare/a bug.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it could be done with a couple of let Some(...) = self.derps.get[_mut](...) else { ... };, but maybe that causes multiple borrow errors?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's exactly what I ran into - multiple mut borrow of self.derps. I didn't play with ordering/scoping too much though, I definitely might have missed a better way that avoids the multiple borrows.

@npry npry left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the whole diff that's required to fix the issue?

diff --git a/ts_runtime/src/multiderp.rs b/ts_runtime/src/multiderp.rs
index 4883c44f34c..6eb65f0f6ae 100644
--- a/ts_runtime/src/multiderp.rs
+++ b/ts_runtime/src/multiderp.rs
@@ -325,7 +325,9 @@ impl Message<DerpLatencyMeasurement> for Multiderp {
             return;
         };
 
-        if let Some(home_derp) = self.current_home_derp {
+        if let Some(home_derp) = self.current_home_derp
+            && home_derp != result.id
+        {
             self.derps
                 .get_mut(&home_derp)
                 .unwrap()

It's not correct to drop the message if self.derps doesn't contain the region, the measurement is still valid and has dictated what was reported to control. self.current_home_derp is a dispositional field: "this is what is supposed to be our home derp", it's not a record of which running task is handling the home derp server if it exists. This message handler always updates that field to be consistent with our most up-to-date view of what's been reported to control and notifies the old/new home-derp tasks if each exists, the ts_control::StateUpdate handler is meant to handle reconciling task state with the derp map. I can see how this is hard to reason about though and I'm happy to add more documentation to clarify this

Signed-off-by: Dylan Bargatze <dylan@tailscale.com>
@dylan-tailscale

dylan-tailscale commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Isn't this the whole diff that's required to fix the issue?

Yes, that fixes the .send_replace(false) without a corresponding .send_replace(true) issue; however, it doesn't fix the .unwrap()s potentially causing a panic if self.derps doesn't contain the current region (very unlikely, or it would have panicked earlier), or the new region. Most of this PR is a gold-plated fix for missing regions in self.derps plus debug logging for "we checked latency but home region didn't change" - albeit yeah, quite verbose.

If you're sure self.derps.get_mut(...).unwrap() can't panic in any case, happy to drop all of this in favor of your suggestion!

It's not correct to drop the message if self.derps doesn't contain the region, the measurement is still valid and has dictated what was reported to control. self.current_home_derp is a dispositional field: "this is what is supposed to be our home derp", it's not a record of which running task is handling the home derp server if it exists. This message handler always updates that field to be consistent with our most up-to-date view of what's been reported to control and notifies the old/new home-derp tasks if each exists, the ts_control::StateUpdate handler is meant to handle reconciling task state with the derp map. I can see how this is hard to reason about though and I'm happy to add more documentation to clarify this

Ah okay, that tracks - I shouldn't have rolled setting self.current_home_derp into the send_replace() handling. Fixed, and added some comments around self.current_derp_home; I left the rest as-is pending discussion on unwrap()s. Also happy for feedback if any of my comments (existing or added) are still missing the mark!

@dylan-tailscale dylan-tailscale requested a review from npry June 17, 2026 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ts_control, ts_transport_derp: DERP connectivity lost after netmap stream restart

3 participants