feat(ble): data-path liveness probe — detect + recover from data-dead links (v0.4.0)#41
feat(ble): data-path liveness probe — detect + recover from data-dead links (v0.4.0)#41torlando-tech wants to merge 3 commits into
Conversation
…d links A BLE peer can be connected at the link layer (which keeps idle links up) and still exchanging keepalives while real data silently fails -- a permanent deadlock that the reactive zombie check, _validate_spawned_interfaces, and the keepalive-write-fail reaper all miss, because the LL link is genuinely up. The peer then stays "connected" forever while no data flows, with no recovery. Add an active round-trip data-path probe in BLEInterface (protocol v0.4.0): - A 2-byte PING(0x04)/PONG(0x05) control frame sent over the real data path (driver.send), so it fails exactly when real data fails. The frames are shorter than the 5-byte fragment header, so peers that predate the probe reject them as "too short" and are unaffected. - A peer is marked probe-capable on the first PING/PONG seen (auto-negotiated, no handshake change). Only probe-capable peers are eligible for probe-driven reconnect, so older peers are never falsely reaped. - A periodic loop PINGs any link idle longer than _probe_interval. On a healthy link the peer echoes a PONG, refreshing _last_real_data -- the probe is itself the traffic that keeps a genuinely idle-but-healthy link from ever looking dead, so idle links are never churned. - If a probe-capable peer's data path is silent past _data_path_timeout, the link is data-dead; driver.disconnect() tears it down so it re-establishes. Intervals are config-tunable: data_path_probe_interval (15s), data_path_timeout (45s), data_path_probe_poll_interval (10s). Bumps the package to 0.3.0 with a CHANGELOG entry. Validated end-to-end on two Linux/BlueZ nodes: a healthy idle link is kept fresh with zero false reaps; a forced data-dead link is detected and recovers automatically once the data path is restored. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Formal spec for the v0.4.0 protocol extension: PING(0x04)/PONG(0x05) frame format, probe state machine (capability negotiation, liveness tracking, periodic sweep, asymmetric-failure handling), config knobs, backwards compatibility (incl. the dev:-prefix address normalization requirement), and per-platform implementation notes. Extends v0.3.0 / v2.2; backwards compatible. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e section Moves the BLE architecture explainer into ble-reticulum (docs/ble-architecture.md). Reframes the header as platform-agnostic — the BLEInterface/BLEPeerInterface Python layer is the lib core; the native driver (Android/Kotlin via Chaquopy, Linux BlueZ, iOS swift) is pluggable and the Android driver is used as the reference. Updates for protocol v0.4.0: - Keepalive Mechanism: clarifies keepalives prove the *link*, not the *data path*. - New "Data-Path Liveness Probe" section: PING/PONG frame table, sequence diagram, capability auto-negotiation, idle-fresh / data-dead-reconnect, tunables. Android class names (KotlinBLEBridge, BleGattClient, …) retained as the reference native-driver components. (Deeper genericization / a Columba-side stub TBD.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Greptile SummaryThis PR adds a data-path liveness probe (PING
Confidence Score: 3/5Two logic defects in the new probe loop should be fixed before merging; the rest of the change looks solid. The probe timer's src/ble_reticulum/BLEInterface.py — specifically Important Files Changed
Sequence DiagramsequenceDiagram
participant Timer as ProbeTimer (periodic)
participant Local as BLEInterface (local)
participant Driver as driver.send / disconnect
participant Remote as BLEInterface (remote peer)
Note over Timer,Remote: Healthy idle link
Timer->>Local: _run_data_path_probes()
Local->>Local: "idle = now - _last_real_data[peer]"
alt "idle > _probe_interval AND not data-dead"
Local->>Driver: send PING
Driver-->>Remote: 2-byte PING frame
Remote->>Remote: mark probe_capable, refresh _last_real_data
Remote->>Driver: send PONG
Driver-->>Local: 2-byte PONG frame
Local->>Local: refresh _last_real_data
end
Local->>Timer: reschedule via finally
Note over Timer,Remote: Data-dead link
Timer->>Local: _run_data_path_probes()
Local->>Local: "idle > _data_path_timeout AND probe_capable"
Local->>Driver: disconnect(address)
Driver-->>Remote: BLE disconnect
Remote->>Remote: rediscover + reconnect
Prompt To Fix All With AIFix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
src/ble_reticulum/BLEInterface.py:792-793
**Zombie probe timer after `detach()` — missing `online` guard**
`_run_data_path_probes` unconditionally reschedules itself via `finally: self._start_probe_timer()`, with no `self.online` check. If `detach()` cancels `self._probe_timer` while the loop is already mid-execution, the `finally` block creates a *new* `threading.Timer` that `detach()` never cancels — leaving a background probe loop that keeps allocating daemon timers for the lifetime of the process. The companion `_periodic_cleanup_task` has the exact guard needed (`if not self.online: return`) before it reschedules itself; the probe loop should mirror it.
### Issue 2 of 3
src/ble_reticulum/BLEInterface.py:739-745
**PONG is sent even when the peer address is unmapped — log says "dropping" but the frame is not dropped**
When `peer_identity` lookup fails, the method logs `"dropping"` but then falls through to `if data[0] == self._probe_ping:` and replies with a PONG anyway. Any frame matching the 2-byte probe format received on an unmapped address gets a PONG while leaving `_last_real_data` and `_probe_capable` unmodified for that peer. The result is an asymmetric liveness view: the remote side's timer is refreshed by our PONG (it never detects a dead path toward us), while our side silently fails to track the peer. The PONG send should be placed inside the `if peer_identity:` branch.
### Issue 3 of 3
src/ble_reticulum/BLEInterface.py:778-781
When `idle > _data_path_timeout`, both a PING and a `driver.disconnect()` are triggered in the same sweep iteration. The PING is immediately followed by a disconnect — the PONG will never arrive and the probe traffic is wasted. The probe send should be skipped when a disconnect is already being issued.
```suggestion
idle = now - self._last_real_data.get(identity_hash, now)
is_data_dead = self._probe_capable.get(identity_hash) and idle > self._data_path_timeout
if idle > self._probe_interval and not is_data_dead:
self._send_probe(address, self._probe_ping, int(now))
if is_data_dead:
```
Reviews (1): Last reviewed commit: "docs(ble): migrate BLE architecture expl..." | Re-trigger Greptile |
| finally: | ||
| self._start_probe_timer() |
There was a problem hiding this comment.
Zombie probe timer after
detach() — missing online guard
_run_data_path_probes unconditionally reschedules itself via finally: self._start_probe_timer(), with no self.online check. If detach() cancels self._probe_timer while the loop is already mid-execution, the finally block creates a new threading.Timer that detach() never cancels — leaving a background probe loop that keeps allocating daemon timers for the lifetime of the process. The companion _periodic_cleanup_task has the exact guard needed (if not self.online: return) before it reschedules itself; the probe loop should mirror it.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ble_reticulum/BLEInterface.py
Line: 792-793
Comment:
**Zombie probe timer after `detach()` — missing `online` guard**
`_run_data_path_probes` unconditionally reschedules itself via `finally: self._start_probe_timer()`, with no `self.online` check. If `detach()` cancels `self._probe_timer` while the loop is already mid-execution, the `finally` block creates a *new* `threading.Timer` that `detach()` never cancels — leaving a background probe loop that keeps allocating daemon timers for the lifetime of the process. The companion `_periodic_cleanup_task` has the exact guard needed (`if not self.online: return`) before it reschedules itself; the probe loop should mirror it.
How can I resolve this? If you propose a fix, please make it concise.| else: | ||
| RNS.log(f"{self} data-path probe from unmapped address {address}, dropping", RNS.LOG_EXTREME) | ||
| if data[0] == self._probe_ping: | ||
| self._send_probe(address, self._probe_pong, data[1]) | ||
| RNS.log(f"{self} data-path PING from {address}, replied PONG", RNS.LOG_EXTREME) | ||
| else: | ||
| RNS.log(f"{self} data-path PONG from {address}", RNS.LOG_EXTREME) |
There was a problem hiding this comment.
PONG is sent even when the peer address is unmapped — log says "dropping" but the frame is not dropped
When peer_identity lookup fails, the method logs "dropping" but then falls through to if data[0] == self._probe_ping: and replies with a PONG anyway. Any frame matching the 2-byte probe format received on an unmapped address gets a PONG while leaving _last_real_data and _probe_capable unmodified for that peer. The result is an asymmetric liveness view: the remote side's timer is refreshed by our PONG (it never detects a dead path toward us), while our side silently fails to track the peer. The PONG send should be placed inside the if peer_identity: branch.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ble_reticulum/BLEInterface.py
Line: 739-745
Comment:
**PONG is sent even when the peer address is unmapped — log says "dropping" but the frame is not dropped**
When `peer_identity` lookup fails, the method logs `"dropping"` but then falls through to `if data[0] == self._probe_ping:` and replies with a PONG anyway. Any frame matching the 2-byte probe format received on an unmapped address gets a PONG while leaving `_last_real_data` and `_probe_capable` unmodified for that peer. The result is an asymmetric liveness view: the remote side's timer is refreshed by our PONG (it never detects a dead path toward us), while our side silently fails to track the peer. The PONG send should be placed inside the `if peer_identity:` branch.
How can I resolve this? If you propose a fix, please make it concise.| idle = now - self._last_real_data.get(identity_hash, now) | ||
| if idle > self._probe_interval: | ||
| self._send_probe(address, self._probe_ping, int(now)) | ||
| if self._probe_capable.get(identity_hash) and idle > self._data_path_timeout: |
There was a problem hiding this comment.
When
idle > _data_path_timeout, both a PING and a driver.disconnect() are triggered in the same sweep iteration. The PING is immediately followed by a disconnect — the PONG will never arrive and the probe traffic is wasted. The probe send should be skipped when a disconnect is already being issued.
| idle = now - self._last_real_data.get(identity_hash, now) | |
| if idle > self._probe_interval: | |
| self._send_probe(address, self._probe_ping, int(now)) | |
| if self._probe_capable.get(identity_hash) and idle > self._data_path_timeout: | |
| idle = now - self._last_real_data.get(identity_hash, now) | |
| is_data_dead = self._probe_capable.get(identity_hash) and idle > self._data_path_timeout | |
| if idle > self._probe_interval and not is_data_dead: | |
| self._send_probe(address, self._probe_ping, int(now)) | |
| if is_data_dead: |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ble_reticulum/BLEInterface.py
Line: 778-781
Comment:
When `idle > _data_path_timeout`, both a PING and a `driver.disconnect()` are triggered in the same sweep iteration. The PING is immediately followed by a disconnect — the PONG will never arrive and the probe traffic is wasted. The probe send should be skipped when a disconnect is already being issued.
```suggestion
idle = now - self._last_real_data.get(identity_hash, now)
is_data_dead = self._probe_capable.get(identity_hash) and idle > self._data_path_timeout
if idle > self._probe_interval and not is_data_dead:
self._send_probe(address, self._probe_ping, int(now))
if is_data_dead:
```
How can I resolve this? If you propose a fix, please make it concise.
Summary
Adds a data-path liveness probe so the BLE interface detects and recovers from "connected but data-dead" links — where the link layer stays up (and keepalives keep passing) but real data silently stops flowing. Today nothing catches this: the reactive zombie checks only fire on a real disconnect or a write failure, neither of which happens while the link is LL-connected but data-dead.
How it works
PING (0x04)/PONG (0x05)round-trip over the real data path (viadriver.send). A peer is marked probe-capable on the first PING/PONG seen — auto-negotiated, no version handshake.data_path_probe_interval(the probe is itself the keep-fresh traffic, so a genuinely idle-but-healthy link is never falsely reaped), and reconnects a probe-capable peer whose last real data is stale pastdata_path_timeout(driver.disconnect-> rediscover).data_path_probe_interval,data_path_timeout,data_path_probe_poll_interval).Also in this PR
BLE_PROTOCOL_v0.4.0.md— formal spec (frame format, capability state machine, config, backwards-compat /dev:-prefix normalization).docs/ble-architecture.md— migrated the BLE architecture explainer in from the app repo and added a data-path-probe section.0.2.2->0.3.0+ CHANGELOG entry.Validation
Validated end-to-end on two Linux/BlueZ peers via fault injection (1-byte keepalives pass, larger fragments dropped): healthy = stable, no idle churn; data-dead = probe fires + reconnects on both peers; cleared = re-peers automatically. Mirrored to the swift and kotlin Reticulum ports.
🤖 Generated with Claude Code