ts_runtime: fix logic for derp home region change#242
Conversation
Fixes logic in `Multiderp` to properly deselect the current home region (if any) before selecting the new home region, and adds some resilience around missing regions in the map. Signed-off-by: Dylan Bargatze <dylan@tailscale.com>
| // We now know the home derp region has changed. Sanity check - ensure that both the current | ||
| // and new home derp regions exist in the region map. Doing this ahead of time and exiting | ||
| // early avoids nasty unwind logic if we detect a missing region when we try to change it. | ||
| if !self.derps.contains_key(&new_region_id) { | ||
| tracing::error!(%new_region_id, "new home derp region missing in map, not updating"); | ||
| return; | ||
| } else if let Some(cur_region_id) = self.current_home_derp | ||
| && !self.derps.contains_key(&cur_region_id) | ||
| { | ||
| tracing::error!(%cur_region_id, "current home derp region missing in map, not updating"); | ||
| return; | ||
| } |
There was a problem hiding this comment.
I'm open to a cleaner approach here, if anyone has ideas. Without the sanity checks, in the case where the new region isn't in self.derps, the code below would have to get the current region again with self.derps.get_mut() to call .send_replace(true) and say "just kidding, you actually are the home derp region again, because the new region doesn't exist in the region map." This also runs into some problems with mutable borrows of self.derps and the lifetimes of the new/current regions.
All that said, it irks me that we're calling self.derps.contains_key() just to go ahead and get the regions with get_mut().unwrap(); missing regions should be rare/a bug.
There was a problem hiding this comment.
This looks like it could be done with a couple of let Some(...) = self.derps.get[_mut](...) else { ... };, but maybe that causes multiple borrow errors?
There was a problem hiding this comment.
Yup, that's exactly what I ran into - multiple mut borrow of self.derps. I didn't play with ordering/scoping too much though, I definitely might have missed a better way that avoids the multiple borrows.
There was a problem hiding this comment.
Isn't this the whole diff that's required to fix the issue?
diff --git a/ts_runtime/src/multiderp.rs b/ts_runtime/src/multiderp.rs
index 4883c44f34c..6eb65f0f6ae 100644
--- a/ts_runtime/src/multiderp.rs
+++ b/ts_runtime/src/multiderp.rs
@@ -325,7 +325,9 @@ impl Message<DerpLatencyMeasurement> for Multiderp {
return;
};
- if let Some(home_derp) = self.current_home_derp {
+ if let Some(home_derp) = self.current_home_derp
+ && home_derp != result.id
+ {
self.derps
.get_mut(&home_derp)
.unwrap()It's not correct to drop the message if self.derps doesn't contain the region, the measurement is still valid and has dictated what was reported to control. self.current_home_derp is a dispositional field: "this is what is supposed to be our home derp", it's not a record of which running task is handling the home derp server if it exists. This message handler always updates that field to be consistent with our most up-to-date view of what's been reported to control and notifies the old/new home-derp tasks if each exists, the ts_control::StateUpdate handler is meant to handle reconciling task state with the derp map. I can see how this is hard to reason about though and I'm happy to add more documentation to clarify this
Signed-off-by: Dylan Bargatze <dylan@tailscale.com>
Yes, that fixes the If you're sure
Ah okay, that tracks - I shouldn't have rolled setting |
Fixes logic in
Multiderpto properly deselect the current home region (if any) before selecting the new home region, and adds some resilience around missing regions in the map.This was root cause of #26. We would perform DERP latency testing and correctly re-report the home region to the control plane when re-connecting to the netmap stream, whether the home region had changed or not. However, in
Multiderp's message handler forDerpLatencyMeasurementwhen the home region was already set and hadn't changed, we would incorrectly de-select the existing DERP home region without ever re-selecting it. I had discovered this awhile ago, but missed it as the root cause because it was masked by the bug fixed in #239.AI disclosure: I used Claude Code to help diagnose/prototype a fix, but rewrote all the code by hand.
Closes #26.