Skip to content

feat(ble): data-path liveness probe — detect + recover from data-dead links (v0.4.0)#41

Open
torlando-tech wants to merge 3 commits into
mainfrom
feat/data-path-liveness-probe
Open

feat(ble): data-path liveness probe — detect + recover from data-dead links (v0.4.0)#41
torlando-tech wants to merge 3 commits into
mainfrom
feat/data-path-liveness-probe

Conversation

@torlando-tech

Copy link
Copy Markdown
Owner

Summary

Adds a data-path liveness probe so the BLE interface detects and recovers from "connected but data-dead" links — where the link layer stays up (and keepalives keep passing) but real data silently stops flowing. Today nothing catches this: the reactive zombie checks only fire on a real disconnect or a write failure, neither of which happens while the link is LL-connected but data-dead.

How it works

  • A 2-byte PING (0x04) / PONG (0x05) round-trip over the real data path (via driver.send). A peer is marked probe-capable on the first PING/PONG seen — auto-negotiated, no version handshake.
  • A periodic sweep PINGs any link with no real data past data_path_probe_interval (the probe is itself the keep-fresh traffic, so a genuinely idle-but-healthy link is never falsely reaped), and reconnects a probe-capable peer whose last real data is stale past data_path_timeout (driver.disconnect -> rediscover).
  • The 2-byte frames are shorter than the 5-byte fragment header, so peers predating the probe reject them as "too short" -> fully backwards compatible.
  • Probe interval / timeout / poll interval are config-tunable (data_path_probe_interval, data_path_timeout, data_path_probe_poll_interval).

Also in this PR

  • BLE_PROTOCOL_v0.4.0.md — formal spec (frame format, capability state machine, config, backwards-compat / dev:-prefix normalization).
  • docs/ble-architecture.md — migrated the BLE architecture explainer in from the app repo and added a data-path-probe section.
  • Version bump 0.2.2 -> 0.3.0 + CHANGELOG entry.

Validation

Validated end-to-end on two Linux/BlueZ peers via fault injection (1-byte keepalives pass, larger fragments dropped): healthy = stable, no idle churn; data-dead = probe fires + reconnects on both peers; cleared = re-peers automatically. Mirrored to the swift and kotlin Reticulum ports.

🤖 Generated with Claude Code

torlando-tech and others added 3 commits June 10, 2026 11:27
…d links

A BLE peer can be connected at the link layer (which keeps idle links up) and
still exchanging keepalives while real data silently fails -- a permanent
deadlock that the reactive zombie check, _validate_spawned_interfaces, and the
keepalive-write-fail reaper all miss, because the LL link is genuinely up. The
peer then stays "connected" forever while no data flows, with no recovery.

Add an active round-trip data-path probe in BLEInterface (protocol v0.4.0):

- A 2-byte PING(0x04)/PONG(0x05) control frame sent over the real data path
  (driver.send), so it fails exactly when real data fails. The frames are
  shorter than the 5-byte fragment header, so peers that predate the probe
  reject them as "too short" and are unaffected.
- A peer is marked probe-capable on the first PING/PONG seen (auto-negotiated,
  no handshake change). Only probe-capable peers are eligible for probe-driven
  reconnect, so older peers are never falsely reaped.
- A periodic loop PINGs any link idle longer than _probe_interval. On a healthy
  link the peer echoes a PONG, refreshing _last_real_data -- the probe is itself
  the traffic that keeps a genuinely idle-but-healthy link from ever looking
  dead, so idle links are never churned.
- If a probe-capable peer's data path is silent past _data_path_timeout, the
  link is data-dead; driver.disconnect() tears it down so it re-establishes.

Intervals are config-tunable: data_path_probe_interval (15s),
data_path_timeout (45s), data_path_probe_poll_interval (10s). Bumps the package
to 0.3.0 with a CHANGELOG entry.

Validated end-to-end on two Linux/BlueZ nodes: a healthy idle link is kept fresh
with zero false reaps; a forced data-dead link is detected and recovers
automatically once the data path is restored.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Formal spec for the v0.4.0 protocol extension: PING(0x04)/PONG(0x05) frame
format, probe state machine (capability negotiation, liveness tracking,
periodic sweep, asymmetric-failure handling), config knobs, backwards
compatibility (incl. the dev:-prefix address normalization requirement), and
per-platform implementation notes. Extends v0.3.0 / v2.2; backwards compatible.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e section

Moves the BLE architecture explainer into ble-reticulum (docs/ble-architecture.md).
Reframes the header as platform-agnostic — the BLEInterface/BLEPeerInterface Python
layer is the lib core; the native driver (Android/Kotlin via Chaquopy, Linux BlueZ,
iOS swift) is pluggable and the Android driver is used as the reference.

Updates for protocol v0.4.0:
- Keepalive Mechanism: clarifies keepalives prove the *link*, not the *data path*.
- New "Data-Path Liveness Probe" section: PING/PONG frame table, sequence diagram,
  capability auto-negotiation, idle-fresh / data-dead-reconnect, tunables.

Android class names (KotlinBLEBridge, BleGattClient, …) retained as the reference
native-driver components. (Deeper genericization / a Columba-side stub TBD.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 62 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/ble_reticulum/BLEInterface.py 0.00% 62 Missing ⚠️

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds a data-path liveness probe (PING 0x04 / PONG 0x05) to BLEInterface that periodically sends a 2-byte round-trip over the real data path to detect "connected but data-dead" BLE links and force a reconnect. Capability is auto-negotiated and the 2-byte frame is shorter than the fragment header, preserving backwards compatibility with pre-v0.4.0 peers.

  • Core logic (BLEInterface.py): new _send_probe, _handle_probe_frame, _start_probe_timer, and _run_data_path_probes methods; probe timer cancelled in detach(); _probe_capable cleaned up on peer disconnect.
  • Spec + docs (BLE_PROTOCOL_v0.4.0.md, docs/ble-architecture.md): formal protocol spec and updated architecture doc.
  • Version bump (pyproject.toml, CHANGELOG.md): 0.2.2 → 0.3.0.

Confidence Score: 3/5

Two logic defects in the new probe loop should be fixed before merging; the rest of the change looks solid.

The probe timer's finally: _start_probe_timer() block has no self.online guard, unlike the existing cleanup timer which checks if not self.online: return before rescheduling. If detach() fires while _run_data_path_probes is already executing, a new timer is created after detach() already cancelled the old one. Separately, _handle_probe_frame logs 'dropping' for an unmapped address but still falls through to send a PONG, creating an asymmetric liveness view.

src/ble_reticulum/BLEInterface.py — specifically _run_data_path_probes (reschedule guard) and _handle_probe_frame (unmapped-address PONG control flow).

Important Files Changed

Filename Overview
src/ble_reticulum/BLEInterface.py Adds the data-path liveness probe (PING/PONG loop, timer, detach cleanup); two logic bugs: probe timer can escape detach() and PONG reply is sent even for unmapped addresses.
BLE_PROTOCOL_v0.4.0.md New protocol spec documenting the v0.4.0 PING/PONG probe — well-structured and consistent with the implementation.
CHANGELOG.md Adds 0.3.0 changelog entry describing the liveness probe; accurate and complete.
pyproject.toml Version bumped from 0.2.2 to 0.3.0 to reflect the new feature.
docs/ble-architecture.md New architecture doc migrated from app repo with a data-path probe section added; documentation only.

Sequence Diagram

sequenceDiagram
    participant Timer as ProbeTimer (periodic)
    participant Local as BLEInterface (local)
    participant Driver as driver.send / disconnect
    participant Remote as BLEInterface (remote peer)

    Note over Timer,Remote: Healthy idle link

    Timer->>Local: _run_data_path_probes()
    Local->>Local: "idle = now - _last_real_data[peer]"
    alt "idle > _probe_interval AND not data-dead"
        Local->>Driver: send PING
        Driver-->>Remote: 2-byte PING frame
        Remote->>Remote: mark probe_capable, refresh _last_real_data
        Remote->>Driver: send PONG
        Driver-->>Local: 2-byte PONG frame
        Local->>Local: refresh _last_real_data
    end
    Local->>Timer: reschedule via finally

    Note over Timer,Remote: Data-dead link

    Timer->>Local: _run_data_path_probes()
    Local->>Local: "idle > _data_path_timeout AND probe_capable"
    Local->>Driver: disconnect(address)
    Driver-->>Remote: BLE disconnect
    Remote->>Remote: rediscover + reconnect
Loading

Fix All in Claude Code

Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
src/ble_reticulum/BLEInterface.py:792-793
**Zombie probe timer after `detach()` — missing `online` guard**

`_run_data_path_probes` unconditionally reschedules itself via `finally: self._start_probe_timer()`, with no `self.online` check. If `detach()` cancels `self._probe_timer` while the loop is already mid-execution, the `finally` block creates a *new* `threading.Timer` that `detach()` never cancels — leaving a background probe loop that keeps allocating daemon timers for the lifetime of the process. The companion `_periodic_cleanup_task` has the exact guard needed (`if not self.online: return`) before it reschedules itself; the probe loop should mirror it.

### Issue 2 of 3
src/ble_reticulum/BLEInterface.py:739-745
**PONG is sent even when the peer address is unmapped — log says "dropping" but the frame is not dropped**

When `peer_identity` lookup fails, the method logs `"dropping"` but then falls through to `if data[0] == self._probe_ping:` and replies with a PONG anyway. Any frame matching the 2-byte probe format received on an unmapped address gets a PONG while leaving `_last_real_data` and `_probe_capable` unmodified for that peer. The result is an asymmetric liveness view: the remote side's timer is refreshed by our PONG (it never detects a dead path toward us), while our side silently fails to track the peer. The PONG send should be placed inside the `if peer_identity:` branch.

### Issue 3 of 3
src/ble_reticulum/BLEInterface.py:778-781
When `idle > _data_path_timeout`, both a PING and a `driver.disconnect()` are triggered in the same sweep iteration. The PING is immediately followed by a disconnect — the PONG will never arrive and the probe traffic is wasted. The probe send should be skipped when a disconnect is already being issued.

```suggestion
                idle = now - self._last_real_data.get(identity_hash, now)
                is_data_dead = self._probe_capable.get(identity_hash) and idle > self._data_path_timeout
                if idle > self._probe_interval and not is_data_dead:
                    self._send_probe(address, self._probe_ping, int(now))
                if is_data_dead:
```

Reviews (1): Last reviewed commit: "docs(ble): migrate BLE architecture expl..." | Re-trigger Greptile

Comment on lines +792 to +793
finally:
self._start_probe_timer()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Zombie probe timer after detach() — missing online guard

_run_data_path_probes unconditionally reschedules itself via finally: self._start_probe_timer(), with no self.online check. If detach() cancels self._probe_timer while the loop is already mid-execution, the finally block creates a new threading.Timer that detach() never cancels — leaving a background probe loop that keeps allocating daemon timers for the lifetime of the process. The companion _periodic_cleanup_task has the exact guard needed (if not self.online: return) before it reschedules itself; the probe loop should mirror it.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ble_reticulum/BLEInterface.py
Line: 792-793

Comment:
**Zombie probe timer after `detach()` — missing `online` guard**

`_run_data_path_probes` unconditionally reschedules itself via `finally: self._start_probe_timer()`, with no `self.online` check. If `detach()` cancels `self._probe_timer` while the loop is already mid-execution, the `finally` block creates a *new* `threading.Timer` that `detach()` never cancels — leaving a background probe loop that keeps allocating daemon timers for the lifetime of the process. The companion `_periodic_cleanup_task` has the exact guard needed (`if not self.online: return`) before it reschedules itself; the probe loop should mirror it.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

Comment on lines +739 to +745
else:
RNS.log(f"{self} data-path probe from unmapped address {address}, dropping", RNS.LOG_EXTREME)
if data[0] == self._probe_ping:
self._send_probe(address, self._probe_pong, data[1])
RNS.log(f"{self} data-path PING from {address}, replied PONG", RNS.LOG_EXTREME)
else:
RNS.log(f"{self} data-path PONG from {address}", RNS.LOG_EXTREME)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 PONG is sent even when the peer address is unmapped — log says "dropping" but the frame is not dropped

When peer_identity lookup fails, the method logs "dropping" but then falls through to if data[0] == self._probe_ping: and replies with a PONG anyway. Any frame matching the 2-byte probe format received on an unmapped address gets a PONG while leaving _last_real_data and _probe_capable unmodified for that peer. The result is an asymmetric liveness view: the remote side's timer is refreshed by our PONG (it never detects a dead path toward us), while our side silently fails to track the peer. The PONG send should be placed inside the if peer_identity: branch.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ble_reticulum/BLEInterface.py
Line: 739-745

Comment:
**PONG is sent even when the peer address is unmapped — log says "dropping" but the frame is not dropped**

When `peer_identity` lookup fails, the method logs `"dropping"` but then falls through to `if data[0] == self._probe_ping:` and replies with a PONG anyway. Any frame matching the 2-byte probe format received on an unmapped address gets a PONG while leaving `_last_real_data` and `_probe_capable` unmodified for that peer. The result is an asymmetric liveness view: the remote side's timer is refreshed by our PONG (it never detects a dead path toward us), while our side silently fails to track the peer. The PONG send should be placed inside the `if peer_identity:` branch.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

Comment on lines +778 to +781
idle = now - self._last_real_data.get(identity_hash, now)
if idle > self._probe_interval:
self._send_probe(address, self._probe_ping, int(now))
if self._probe_capable.get(identity_hash) and idle > self._data_path_timeout:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 When idle > _data_path_timeout, both a PING and a driver.disconnect() are triggered in the same sweep iteration. The PING is immediately followed by a disconnect — the PONG will never arrive and the probe traffic is wasted. The probe send should be skipped when a disconnect is already being issued.

Suggested change
idle = now - self._last_real_data.get(identity_hash, now)
if idle > self._probe_interval:
self._send_probe(address, self._probe_ping, int(now))
if self._probe_capable.get(identity_hash) and idle > self._data_path_timeout:
idle = now - self._last_real_data.get(identity_hash, now)
is_data_dead = self._probe_capable.get(identity_hash) and idle > self._data_path_timeout
if idle > self._probe_interval and not is_data_dead:
self._send_probe(address, self._probe_ping, int(now))
if is_data_dead:
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ble_reticulum/BLEInterface.py
Line: 778-781

Comment:
When `idle > _data_path_timeout`, both a PING and a `driver.disconnect()` are triggered in the same sweep iteration. The PING is immediately followed by a disconnect — the PONG will never arrive and the probe traffic is wasted. The probe send should be skipped when a disconnect is already being issued.

```suggestion
                idle = now - self._last_real_data.get(identity_hash, now)
                is_data_dead = self._probe_capable.get(identity_hash) and idle > self._data_path_timeout
                if idle > self._probe_interval and not is_data_dead:
                    self._send_probe(address, self._probe_ping, int(now))
                if is_data_dead:
```

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant