diff --git a/BLE_PROTOCOL_v0.4.0.md b/BLE_PROTOCOL_v0.4.0.md new file mode 100644 index 0000000..5e7d027 --- /dev/null +++ b/BLE_PROTOCOL_v0.4.0.md @@ -0,0 +1,182 @@ +# BLE-Reticulum Protocol Specification v0.4.0 + +**Version**: 0.4.0 +**Date**: June 2026 +**Status**: Draft +**Backwards Compatible With**: v0.3.0, v2.2 + +## 1. Overview + +This document specifies the v0.4.0 extension to the BLE-Reticulum protocol. This +version adds a **data-path liveness probe** to detect and recover from "connected +but data-dead" BLE links. + +### 1.1 Problem Statement + +A BLE connection can remain established at the link layer while application data +silently stops flowing: + +- The Bluetooth **link layer keeps an idle connection alive** indefinitely (empty + PDUs at each connection event). It only drops on supervision timeout — i.e. radio + loss — not on application silence. +- Under RF degradation a link can pass small writes (such as the 1-byte keepalive + used to defeat Android's app-inactivity timeout) while **larger data fragments + fail**: keepalives succeed, real data does not. + +In this state the link is genuinely "up", so every existing liveness mechanism +misses it: + +- the reactive zombie check (`_last_real_data`) is only consulted when a *new* + connection arrives — it is never swept; +- `_validate_spawned_interfaces` reconciles against the driver's connected-peer + set, which still lists the peer; +- the keepalive-write-failure reaper never fires, because keepalive writes still + succeed. + +The peer therefore stays "connected" forever while no data flows, with no detection +and no recovery — a permanent deadlock. (Empirically reproduced between two +Linux/BlueZ nodes.) + +### 1.2 Solution + +v0.4.0 introduces an **active round-trip probe over the real data path**. Each node +periodically sends a small `PING` that the peer echoes as a `PONG`. Because the +probe traverses the same data path as real fragments, it fails exactly when real +data fails. A link that round-trips the probe is proven alive; a link that stops +round-tripping it while still connected at the link layer is data-dead and is torn +down so it re-establishes. + +Crucially the probe **is** the keep-fresh traffic: a genuinely idle-but-healthy +link is kept alive by the probe's own round-trips, so idle links are never falsely +reaped. + +## 2. Frame Format + +v0.4.0 defines two new 2-byte control frames, sent on the same RX characteristic / +notification path as data fragments: + +| Frame | Byte 0 (type) | Byte 1 | Meaning | +|-------|---------------|---------|----------------------------------------| +| PING | `0x04` | nonce | Liveness request | +| PONG | `0x05` | nonce | Liveness reply (echoes the PING nonce) | + +The `nonce` is an opaque 1-byte value chosen by the sender; the responder copies it +verbatim into the PONG. It exists for future round-trip correlation and is not +currently interpreted. + +These type bytes do not collide with the fragment header (`0x01`=START, +`0x02`=CONTINUE, `0x03`=END) or the 1-byte `0x00` keepalive. + +## 3. Probe State Machine + +State is tracked **per peer, keyed by stable identity** (not by BLE address, which +rotates). + +### 3.1 Capability Negotiation + +A peer is considered **probe-capable** once a PING or PONG has been received from +it. No handshake change is required — capability is inferred from observed probe +traffic. Peers that never emit probe frames (pre-v0.4.0) are never marked capable. + +### 3.2 Liveness Tracking + +Receiving any inbound traffic that proves the data path — a real data fragment, a +PING, or a PONG — updates the peer's `last_real_data` timestamp. The 1-byte +keepalive does **not**, by design: it proves only the link, not the data path. + +### 3.3 Periodic Sweep + +Every `data_path_probe_poll_interval`, for each established peer: + +1. If the link has had no real data for longer than `data_path_probe_interval`, + send a PING. A healthy peer echoes a PONG, refreshing `last_real_data`. +2. If the peer is probe-capable **and** `last_real_data` is older than + `data_path_timeout`, the data path is dead: disconnect the peer + (`driver.disconnect`) so the connection re-establishes and re-handshakes. + +A non-probe-capable peer is never reaped by this mechanism; it falls through to the +existing reactive checks. + +### 3.4 PING Handling + +On receiving a PING, a node immediately replies with a PONG echoing the nonce, then +treats the inbound PING itself as proof of data-path liveness. + +### 3.5 Asymmetric Failures + +Because both peers probe independently, each detects the death of its own **inbound** +direction (it stops receiving the other's PINGs/PONGs). If only A→B fails, B sees no +inbound from A, declares the path dead, and reconnects — re-establishing both +directions. One side detecting is sufficient. + +## 4. Configuration + +| Key | Default | Meaning | +|----------------------------------|---------|--------------------------------------------------| +| `data_path_probe_interval` | 15 s | PING a link that has had no real data this long | +| `data_path_timeout` | 45 s | Reconnect a probe-capable peer silent this long | +| `data_path_probe_poll_interval` | 10 s | How often the sweep runs | + +The defaults give roughly three probe attempts before a reconnect and keep an idle +link refreshed well inside the timeout. + +## 5. Backwards Compatibility + +### 5.1 Compatibility Matrix + +| Peers | Behavior | +|--------------------|--------------------------------------------------------------------------------| +| v0.4.0 ↔ v0.4.0 | Full probe + data-dead recovery, both directions. | +| v0.4.0 ↔ older | v0.4.0 still PINGs; the 2-byte frame is shorter than the 5-byte fragment header, so the older peer's reassembler rejects it as "too short" and ignores it. The older peer never replies, never becomes probe-capable, and is never reaped by the probe — it retains pre-v0.4.0 behavior. | + +The probe is therefore safe to deploy incrementally. + +### 5.2 Address Normalization + +On a dual-role (connection-collision) link a peer may deliver a frame under its +`dev:`-prefixed peripheral address while its identity was learned under the plain +MAC via the central-path handshake. Implementations **MUST** normalize (strip the +`dev:` prefix, and try both forms) when resolving a probe frame's identity, or the +frame will fail to attribute and capability will never be established. + +## 6. GATT Service (Unchanged from v2.2) + +The probe reuses the existing RX characteristic (central → peripheral write) and the +notification path (peripheral → central). No new characteristics are added. + +## 7. Implementation Notes + +### 7.1 Python (BlueZ/Bleak) — reference + +`BLEInterface.py`: `_send_probe`, `_handle_probe_frame`, and `_run_data_path_probes` +on a `threading.Timer`. Probe frames are intercepted immediately after the keepalive +filter in both the central (`_handle_ble_data`) and peripheral receive paths, before +reassembly. + +### 7.2 Android (Kotlin driver) + +Android Columba bundles this `BLEInterface.py` via Chaquopy, so it inherits the probe +unchanged. The Kotlin driver must deliver 2-byte writes/notifications unfragmented +(it already does for the 1-byte keepalive). + +### 7.3 swift (CoreBluetooth) — TODO + +reticulum-swift's `BLEInterface` must mirror the probe, plus handle the +CoreBluetooth-specific case of a probe-driven disconnect of a **peripheral-role** +peer: CoreBluetooth cannot force-disconnect a subscribed central, so the app layer +must drop the central and let it reconnect. + +## 8. Version History + +| Version | Date | Change | +|---------|----------|--------------------------------------------------------------------| +| v2.2 | Nov 2025 | Base protocol (MAC sorting, identity handshake, fragmentation, keepalive) | +| v0.3.0 | Dec 2025 | Capability advertisement (peripheral-only devices) | +| v0.4.0 | Jun 2026 | Data-path liveness probe (this document) | + +## 9. References + +- `BLE_PROTOCOL_v2.2.md` — base protocol +- `BLE_PROTOCOL_v0.3.0.md` — capability advertisement +- `docs/ble-architecture.md` — architecture explainer +- `CHANGELOG.md` — 0.3.0 release entry diff --git a/CHANGELOG.md b/CHANGELOG.md index 881a6be..dba6d54 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +## [0.3.0] - 2026-06-10 + +### Added +- **Data-path liveness probe (protocol v0.4.0)** — detects and recovers from "connected + but data-dead" BLE links. A link can stay up at the link layer (which keeps idle + connections alive) and keep passing 1-byte keepalives while larger real data silently + fails; the existing reactive zombie check, `_validate_spawned_interfaces`, and the + keepalive-write-fail reaper all miss this because the link is genuinely up. The probe + sends a 2-byte `PING`(0x04)/`PONG`(0x05) round-trip over the real data path: a healthy + idle link is kept fresh by the probe itself (no churn), while a probe-capable peer + whose data path goes silent past `data_path_timeout` is torn down so it reconnects. + Capability is auto-negotiated (a peer becomes probe-capable on its first PING/PONG); + the 2-byte frames are shorter than the fragment header so older peers reject them + harmlessly. New config keys: `data_path_probe_interval` (default 15s), + `data_path_timeout` (default 45s), `data_path_probe_poll_interval` (default 10s). + Validated end-to-end on two Linux/BlueZ nodes. + ## [0.2.2] - 2025-11-15 ### Added diff --git a/docs/ble-architecture.md b/docs/ble-architecture.md new file mode 100644 index 0000000..4d828ee --- /dev/null +++ b/docs/ble-architecture.md @@ -0,0 +1,803 @@ +# BLE-Reticulum Architecture + +This document describes the Bluetooth Low Energy (BLE) architecture of `ble-reticulum` — the +`RNS.Interface` that carries Reticulum traffic over BLE. The protocol logic and the +`BLEInterface` / `BLEPeerInterface` Python layer are **platform-agnostic**; the native +**driver** beneath them is pluggable (the `BLEDriverInterface` contract, see +`REFACTORING_GUIDE.md`). + +This document uses the **Android driver** (Columba's Chaquopy → Kotlin bridge) as the +reference for the native layer, because it is the most fully featured. The Linux reference +driver (`linux_bluetooth_driver.py`, BlueZ/Bleak) and an iOS/swift driver implement the same +contract. Where you see Android class names below (`KotlinBLEBridge`, `BleGattClient`, +`BleScanner`, …), read them as "the native driver's component". + +> The normative wire protocol lives in `BLE_PROTOCOL_v2.2.md` (base), +> `BLE_PROTOCOL_v0.3.0.md` (capability advertisement) and `BLE_PROTOCOL_v0.4.0.md` +> (data-path liveness probe). This file is the architectural companion to those specs. + +## Architecture Overview + +The BLE implementation follows a layered architecture with clear separation of concerns: + +```mermaid +flowchart TB + subgraph Python["Python Layer (ble-reticulum)"] + BLEInterface["BLEInterface
Protocol handler, fragmentation,
peer lifecycle"] + BLEPeerInterface["BLEPeerInterface
Per-peer Reticulum interface"] + AndroidDriver["AndroidBLEDriver
Chaquopy bridge to Kotlin"] + end + + subgraph Kotlin["Kotlin Native Layer"] + Bridge["KotlinBLEBridge
Main entry point,
PeerInfo tracking,
deduplication"] + Scanner["BleScanner
Adaptive intervals,
service filtering"] + Advertiser["BleAdvertiser
Identity naming,
proactive refresh"] + GattClient["BleGattClient
Central mode,
4-step handshake"] + GattServer["BleGattServer
Peripheral mode,
GATT service"] + OpQueue["BleOperationQueue
Serialized GATT ops"] + end + + subgraph Android["Android BLE Stack"] + BluetoothAdapter["BluetoothAdapter"] + BluetoothLeScanner["BluetoothLeScanner"] + BluetoothLeAdvertiser["BluetoothLeAdvertiser"] + BluetoothGatt["BluetoothGatt"] + BluetoothGattServer["BluetoothGattServer"] + end + + BLEInterface --> BLEPeerInterface + BLEInterface --> AndroidDriver + AndroidDriver -->|Chaquopy| Bridge + Bridge --> Scanner + Bridge --> Advertiser + Bridge --> GattClient + Bridge --> GattServer + GattClient --> OpQueue + Scanner --> BluetoothLeScanner + Advertiser --> BluetoothLeAdvertiser + GattClient --> BluetoothGatt + GattServer --> BluetoothGattServer +``` + +### Layer Responsibilities + +| Layer | Component | Responsibility | +|-------|-----------|----------------| +| Python | `BLEInterface` | Reticulum interface, packet fragmentation/reassembly, peer lifecycle | +| Python | `BLEPeerInterface` | Per-peer Reticulum routing interface | +| Python | `AndroidBLEDriver` | Bridge to Kotlin, callback routing | +| Kotlin | `KotlinBLEBridge` | Entry point for Python, connection tracking, deduplication | +| Kotlin | `BleScanner` | Device discovery with adaptive intervals | +| Kotlin | `BleAdvertiser` | Peripheral advertising with identity | +| Kotlin | `BleGattClient` | Central mode GATT operations | +| Kotlin | `BleGattServer` | Peripheral mode GATT service | +| Kotlin | `BleOperationQueue` | Serialized GATT operations (Android limitation) | + +--- + +## GATT Service Structure + +The Reticulum BLE service follows Protocol v2.2 specification: + +```mermaid +classDiagram + class ReticulumService { + UUID: 37145b00-442d-4a94-917f-8f42c5da28e3 + Type: PRIMARY + } + + class RXCharacteristic { + UUID: 37145b00-442d-4a94-917f-8f42c5da28e5 + Properties: WRITE, WRITE_NO_RESPONSE + Permissions: WRITE + Purpose: Centrals write data here + } + + class TXCharacteristic { + UUID: 37145b00-442d-4a94-917f-8f42c5da28e4 + Properties: READ, NOTIFY, INDICATE + Permissions: READ + Purpose: Peripherals notify data here + } + + class IdentityCharacteristic { + UUID: 37145b00-442d-4a94-917f-8f42c5da28e6 + Properties: READ + Permissions: READ + Purpose: 16-byte transport identity + } + + class CCCDDescriptor { + UUID: 00002902-0000-1000-8000-00805f9b34fb + Purpose: Enable/disable notifications + } + + ReticulumService --> RXCharacteristic + ReticulumService --> TXCharacteristic + ReticulumService --> IdentityCharacteristic + TXCharacteristic --> CCCDDescriptor +``` + +### Characteristic Details + +| Characteristic | UUID Suffix | Direction | Purpose | +|----------------|-------------|-----------|---------| +| RX | `...28e5` | Central → Peripheral | Data and identity handshake writes | +| TX | `...28e4` | Peripheral → Central | Notifications for outbound data | +| Identity | `...28e6` | Read-only | Provides 16-byte transport identity hash | + +--- + +## Connection Flows + +### Central Mode Connection Sequence + +When this device discovers and connects to a peripheral: + +```mermaid +sequenceDiagram + participant Scan as BleScanner + participant Bridge as KotlinBLEBridge + participant Client as BleGattClient + participant Peer as Remote Peripheral + participant Python as AndroidBLEDriver + + Scan->>Bridge: onDeviceDiscovered(address, rssi) + Bridge->>Bridge: shouldConnect(address)? + Note over Bridge: MAC comparison:
our MAC < peer MAC = connect + Bridge->>Client: connect(address) + + rect rgb(230, 245, 255) + Note over Client,Peer: 4-Step GATT Handshake + Client->>Peer: 1. connectGatt() + Peer-->>Client: onConnectionStateChange(CONNECTED) + Client->>Peer: 2. discoverServices() + Peer-->>Client: onServicesDiscovered() + + Client->>Peer: Read Identity Characteristic + Peer-->>Client: 16-byte identity hash + Client->>Bridge: onIdentityReceived(address, hash) + + Client->>Peer: 3. requestMtu(517) + Peer-->>Client: onMtuChanged(negotiated_mtu) + + Client->>Peer: 4. Enable CCCD notifications + Peer-->>Client: onDescriptorWrite(success) + + Client->>Peer: Write our identity to RX + Peer-->>Client: onCharacteristicWrite(success) + end + + Client->>Bridge: onConnected(address, mtu, identity) + Bridge->>Python: onConnected callback + Python->>Python: Spawn BLEPeerInterface +``` + +### Peripheral Mode Connection Sequence + +When a remote central connects to us: + +```mermaid +sequenceDiagram + participant Central as Remote Central + participant Server as BleGattServer + participant Bridge as KotlinBLEBridge + participant Python as AndroidBLEDriver + + Central->>Server: connectGatt() + Server->>Server: onConnectionStateChange(CONNECTED) + Server->>Bridge: onCentralConnected(address, MIN_MTU) + Note over Bridge: Track as pending connection
(identity not yet received) + + Central->>Server: discoverServices() + Central->>Server: Read Identity Characteristic + Server-->>Central: Our 16-byte identity + + Central->>Server: requestMtu() + Server->>Server: onMtuChanged() + Server->>Bridge: onMtuChanged(address, mtu) + + Central->>Server: Enable CCCD notifications + + rect rgb(255, 245, 230) + Note over Central,Server: Identity Handshake + Central->>Server: Write 16 bytes to RX + Server->>Server: Detect: len=16, no existing identity + Server->>Bridge: onIdentityReceived(address, hash) + Server->>Bridge: onDataReceived(address, identity_bytes) + end + + Bridge->>Bridge: Complete connection with identity + Bridge->>Python: onConnected(address, mtu, "peripheral", identity) + Bridge->>Python: onIdentityReceived(address, hash) + Python->>Python: Spawn BLEPeerInterface +``` + +### Defensive Recovery for Missed onConnectionStateChange + +Android's `onConnectionStateChange` callback is unreliable and sometimes doesn't fire, even when a BLE connection is established. When this happens, the connection would be "orphaned" - data arrives but can't be sent back because the address isn't registered. + +The fix: When `handleCharacteristicWriteRequest` receives data from an address not in `connectedCentrals`, it retroactively registers the connection: + +```mermaid +sequenceDiagram + participant Central as Remote Central + participant Server as BleGattServer + participant Bridge as KotlinBLEBridge + participant Python as AndroidBLEDriver + + Central->>Server: connectGatt() + Note over Server: ⚠️ onConnectionStateChange NOT called
(Android BLE bug) + Note over Server: connectedCentrals is empty + + Central->>Server: Write data to RX characteristic + Server->>Server: handleCharacteristicWriteRequest + Server->>Server: Check: address in connectedCentrals? + + rect rgb(255, 230, 230) + Note over Server: DEFENSIVE RECOVERY + Server->>Server: Address NOT found!
Log warning + Server->>Server: Add to connectedCentrals + Server->>Server: Set MTU = MIN_MTU + Server->>Bridge: onCentralConnected(address, mtu) + Bridge->>Bridge: Add to connectedPeers + end + + Server->>Bridge: onDataReceived(address, data) + Note over Server,Python: Connection now properly tracked +``` + +**Key log message**: `"DEFENSIVE RECOVERY: Data received from {address} but onConnectionStateChange was never called!"` + +--- + +## Identity Protocol (v2.2) + +### Purpose + +Android randomizes MAC addresses for privacy. The identity protocol provides stable peer identification across MAC rotations. + +### Handshake Sequence (Central → Peripheral) + +```mermaid +sequenceDiagram + participant C as Central + participant P as Peripheral + + Note over C: Connect as GATT client + C->>P: Read Identity Characteristic + P-->>C: Peripheral's 16-byte identity + Note over C: Store: address → identity + + C->>P: Write 16 bytes to RX Characteristic + Note over P: Detect identity handshake:
exactly 16 bytes, no existing identity + Note over P: Store: address → identity + + Note over C,P: Both sides now have
identity ↔ address mapping +``` + +### Identity Tracking Data Structures + +```mermaid +flowchart LR + subgraph Python["Python (BLEInterface)"] + P_A2I["address_to_identity
MAC → 16-byte identity"] + P_I2A["identity_to_address
hash → MAC"] + P_SI["spawned_interfaces
hash → BLEPeerInterface"] + P_Cache["_identity_cache
MAC → (identity, timestamp)
TTL: 60s"] + end + + subgraph Kotlin["Kotlin (KotlinBLEBridge)"] + K_A2I["addressToIdentity
MAC → 32-char hex"] + K_I2A["identityToAddress
hex → MAC"] + K_Peers["connectedPeers
MAC → PeerConnection"] + K_Pending["pendingConnections
MAC → PendingConnection"] + end + + P_A2I -.->|sync| K_A2I + P_I2A -.->|sync| K_I2A +``` + +### MAC Rotation Handling + +When a peer reconnects with a new MAC address, the handling differs by connection mode: + +#### Overview + +```mermaid +flowchart TD + A[New connection from MAC_NEW] --> B{Identity received?} + B -->|Yes| C[Compute identity_hash] + C --> D{identity_hash in identityToAddress?} + D -->|Yes, points to MAC_OLD| E[MAC Rotation Detected] + E --> F{Is MAC_OLD still connected?} + F -->|No| G[Clean up stale mappings] + G --> H[Update: identity → MAC_NEW] + F -->|Yes| I[Dual connection - deduplicate] + D -->|No| J[New identity - normal flow] + B -->|No, peripheral| K[Wait for handshake] +``` + +#### Central Mode Flow (We Connect to Them) + +Identity is received via GATT read of Identity Characteristic, then processed in Kotlin's `handleIdentityReceived`: + +```mermaid +flowchart TD + A[We connect to MAC_NEW
Read Identity Characteristic] --> B[Kotlin: handleIdentityReceived
Gets 16-byte identity from GATT read] + B --> C{Kotlin: onDuplicateIdentityDetected?
Calls Python callback if set} + C -->|Callback returns True
identity already at different MAC| D[Reject: disconnect MAC_NEW
Log: Duplicate identity rejected] + C -->|Callback returns False
new identity or same MAC| E[Allow connection to proceed] + C -->|No callback set| E + + E --> F[Kotlin: Store addressToIdentity‹MAC_NEW›] + F --> G{Kotlin: identityToAddress‹hash› exists?} + G -->|"No (new identity)"| H[Store identityToAddress‹hash› = MAC_NEW
Notify Python: onConnected] + G -->|"Yes, = MAC_OLD"| I[Keep MAC_OLD as primary in identityToAddress
Still store addressToIdentity‹MAC_NEW›
Notify Python: onConnected] +``` + +**Key code reference**: `KotlinBLEBridge.handleIdentityReceived()` (duplicate identity check requires `onDuplicateIdentityDetected` callback) + +#### Peripheral Mode Flow (They Connect to Us) + +Identity is received via 16-byte write to RX characteristic, detected in Python's `_handle_identity_handshake`: + +```mermaid +flowchart TD + A[MAC_NEW connects to us
Writes 16-byte identity to RX] --> B{Python: _handle_identity_handshake
Entry check: len=16 AND
no address_to_identity‹MAC_NEW›} + B -->|Check fails| Z[Not a handshake, pass to data handler] + B -->|Check passes| C{Python: _check_duplicate_identity
Returns: True if duplicate, False otherwise} + + C -->|"Returns True
(identity_to_address‹hash› = MAC_OLD
AND MAC_OLD ≠ MAC_NEW)"| D[Reject: driver.disconnect‹MAC_NEW›
Log: duplicate identity rejected
Return True: handshake consumed] + C -->|"Returns False
(new identity OR same MAC)"| E[Allow: continue processing] + + E --> F[Store address_to_identity‹MAC_NEW› = identity
Store identity_to_address‹hash› = MAC_NEW] + F --> G{spawned_interfaces‹hash› exists?} + G -->|No| H[Create new BLEPeerInterface
Store in spawned_interfaces‹hash›] + G -->|Yes| I{existing.peer_address ≠ MAC_NEW?} + I -->|Yes| J[Update existing interface:
peer_address = MAC_NEW
address_to_interface‹MAC_NEW› = interface] + I -->|No| K[No update needed, same address] +``` + +**Key code reference**: `BLEInterface._handle_identity_handshake()` at lines 1108-1200 + +#### Return Value Clarification + +The `_check_duplicate_identity` function returns a **boolean**, not a MAC address: + +| Condition | Return Value | Meaning | +|-----------|--------------|---------| +| `identity_to_address[hash]` not found | `False` | New identity, allow | +| `identity_to_address[hash]` = MAC_NEW | `False` | Same MAC, allow | +| `identity_to_address[hash]` = MAC_OLD (≠ MAC_NEW) | `True` | Duplicate, reject | + +--- + +## Deduplication State Machine + +When the same identity is connected via both central and peripheral paths: + +```mermaid +stateDiagram-v2 + [*] --> NONE: Initial state + + NONE --> DualDetected: Same identity on both paths + + DualDetected --> DecisionPoint: Determine which to keep + + DecisionPoint --> CLOSING_CENTRAL: Keep peripheral
(our MAC > peer MAC) + DecisionPoint --> CLOSING_PERIPHERAL: Keep central
(our MAC < peer MAC) + + CLOSING_CENTRAL --> NONE: Central disconnected + CLOSING_PERIPHERAL --> NONE: Peripheral disconnected + + note right of DecisionPoint + Decision based on MAC comparison: + - Lower MAC = central role + - Higher MAC = peripheral role + end note +``` + +### DeduplicationState Enum + +```kotlin +enum class DeduplicationState { + NONE, // Normal - use actual isCentral/isPeripheral + CLOSING_CENTRAL, // Keeping peripheral, central disconnect pending + CLOSING_PERIPHERAL // Keeping central, peripheral disconnect pending +} +``` + +### Deduplication Flow + +```mermaid +sequenceDiagram + participant Bridge as KotlinBLEBridge + participant Client as BleGattClient + participant Server as BleGattServer + participant Python as AndroidBLEDriver + + Note over Bridge: Dual connection detected
Same identity on both paths + + Bridge->>Bridge: Compare MAC addresses + alt Our MAC < Peer MAC (we should be central) + Bridge->>Bridge: Set state = CLOSING_PERIPHERAL + Bridge->>Server: disconnectCentral(address) + Bridge->>Python: onAddressChanged(peripheral_addr, central_addr, identity) + else Our MAC > Peer MAC (we should be peripheral) + Bridge->>Bridge: Set state = CLOSING_CENTRAL + Bridge->>Client: disconnect(address) + Bridge->>Python: onAddressChanged(central_addr, peripheral_addr, identity) + end + + Note over Python: Update address mappings
Migrate fragmenter keys + + Bridge->>Bridge: Set state = NONE +``` + +--- + +## Data Flow + +### Sending Data (Python → BLE) + +```mermaid +flowchart TB + subgraph Python["Python Layer"] + A[BLEPeerInterface.process_outgoing] --> B[Get fragmenter by identity_key] + B --> C[BLEFragmenter.fragment] + C --> D["Fragments with header:
type(1) + seq(2) + total(2)"] + D --> E[AndroidBLEDriver.send] + end + + subgraph Kotlin["Kotlin Layer"] + E --> F[KotlinBLEBridge.sendAsync] + F --> G{Check deduplicationState} + G -->|NONE| H{isCentral?} + G -->|CLOSING_*| I[Block send - in transition] + H -->|Yes| J[GattClient.sendData] + H -->|No| K[GattServer.notifyCentrals] + J --> L[Write to RX characteristic] + K --> M[Notify via TX characteristic] + end + + L --> N[Remote peripheral receives] + M --> O[Remote central receives] +``` + +### Receiving Data (BLE → Python) + +```mermaid +flowchart TB + subgraph BLE["BLE Stack"] + A[Notification/Write received] + end + + subgraph Kotlin["Kotlin Layer"] + A --> B{Is central or peripheral?} + B -->|Central| C[onCharacteristicChanged] + B -->|Peripheral| D[onCharacteristicWriteRequest] + C --> E[Bridge.handleDataReceived] + D --> E + E --> F{First 16 bytes, no identity?} + F -->|Yes| G[Identity handshake - store] + F -->|No| H[Forward to Python] + end + + subgraph Python["Python Layer"] + H --> I[AndroidBLEDriver._handle_data_received] + I --> J{Check identity handshake} + J -->|Yes, 16 bytes| K[_handle_identity_handshake] + J -->|No| L[_handle_ble_data] + L --> M[Get reassembler by identity_key] + M --> N[BLEReassembler.add_fragment] + N --> O{Complete packet?} + O -->|Yes| P[BLEPeerInterface.process_incoming] + O -->|No| Q[Wait for more fragments] + end +``` + +--- + +## Keepalive Mechanism + +Android BLE connections timeout after 20-30 seconds of inactivity. Both layers implement keepalives: + +```mermaid +sequenceDiagram + participant Client as BleGattClient + participant Timer as Keepalive Timer
(15s interval) + participant Peer as Remote Peripheral + + Note over Client: Connection established + Client->>Timer: Start keepalive job + + loop Every 15 seconds + Timer->>Client: Send keepalive + Client->>Peer: Write 0x00 to RX + alt Success + Peer-->>Client: Write confirmed + Client->>Timer: Reset failure counter + else Failure + Client->>Timer: Increment failures + alt failures >= 3 + Timer->>Client: Connection dead + Client->>Client: disconnect() + end + end + end +``` + +### Keepalive Configuration + +| Parameter | Value | Source | +|-----------|-------|--------| +| Interval | 15 seconds | `BleConstants.CONNECTION_KEEPALIVE_INTERVAL_MS` | +| Max failures | 3 | `BleConstants.MAX_CONNECTION_FAILURES` | +| Packet | `0x00` (1 byte) | Minimal overhead | + +Both `BleGattClient` (central) and `BleGattServer` (peripheral) maintain independent keepalive mechanisms. + +> **Keepalives prove the *link*, not the *data path*.** A 1-byte keepalive write succeeds as +> long as the connection exists at the BLE link layer — even when larger data fragments are +> silently failing (RF degradation) and even when the application has stopped sending. Detecting +> a *data-dead* link is the job of the data-path liveness probe (below), not the keepalive. + +--- + +## Data-Path Liveness Probe (protocol v0.4.0) + +A BLE link can be connected at the link layer (which keeps idle connections alive with empty +PDUs and only drops on radio-loss supervision timeout) while real data silently stops flowing — +keepalives still succeed, but larger fragments fail. Every *reactive* liveness check misses +this because the link is genuinely "up": the peer looks "connected" forever while no data moves, +with no detection and no recovery. + +`BLEInterface` adds an **active round-trip probe over the real data path**: + +| Frame | Bytes | Meaning | +|-------|----------------|-----------------------------------------------| +| PING | `0x04` + nonce | Liveness request (sent via `driver.send`) | +| PONG | `0x05` + nonce | Echoed reply | + +```mermaid +sequenceDiagram + participant A as Local + participant B as Peer (probe-capable) + Note over A: idle > data_path_probe_interval (15s) + A->>B: PING (0x04, nonce) [real data path] + alt data path alive + B-->>A: PONG (0x05, nonce) + Note over A: last_real_data refreshed → link stays fresh + else data path dead (PING never arrives) + Note over A: no PONG; last_real_data goes stale + Note over A: stale > data_path_timeout (45s) →
driver.disconnect() → reconnect + re-handshake + end +``` + +Key properties: + +- **The probe is the keep-fresh traffic.** On a genuinely idle-but-healthy link the PING/PONG + round-trip refreshes `last_real_data`, so idle links are *never* falsely reaped. +- **Capability is auto-negotiated.** A peer becomes "probe-capable" on the first PING/PONG seen; + only probe-capable peers are reaped on a dead path. The 2-byte frames are shorter than the + 5-byte fragment header, so peers that predate the probe reject them as "too short" and are + unaffected (and never reaped by the probe). +- **Asymmetric failures** are covered: each side detects death of its own *inbound* direction; + one side reconnecting re-establishes both. +- **Tunable** via `data_path_probe_interval` (15s), `data_path_timeout` (45s), + `data_path_probe_poll_interval` (10s). + +See `BLE_PROTOCOL_v0.4.0.md` for the normative spec. + +--- + +## Scanning and Advertising + +### Adaptive Scanning + +```mermaid +stateDiagram-v2 + [*] --> Active: Start scanning + + Active --> Active: New device discovered + Active --> Idle: 3 scans without new devices + + Idle --> Active: New device discovered + Idle --> Idle: No new devices + + note right of Active + Interval: 5s + Mode: BALANCED or LOW_LATENCY + end note + + note right of Idle + Interval: 30s + Mode: LOW_POWER + end note +``` + +### Scan Configuration + +| Parameter | Active | Idle | +|-----------|--------|------| +| Interval | 5 seconds | 30 seconds | +| Duration | 10 seconds | 10 seconds | +| Mode | `SCAN_MODE_BALANCED` | `SCAN_MODE_LOW_POWER` | +| Threshold | 3 devices | 3 empty scans | + +### Advertising with Proactive Refresh + +```mermaid +sequenceDiagram + participant Adv as BleAdvertiser + participant Timer as Refresh Timer
(60s interval) + participant Android as Android BLE + + Adv->>Android: startAdvertising() + Android-->>Adv: onStartSuccess() + Adv->>Timer: Start refresh job + + loop Every 60 seconds + Timer->>Adv: Proactive refresh + Adv->>Android: stopAdvertising() + Adv->>Android: startAdvertising() + Note over Adv: Ensures advertising persists
after screen off/background + end +``` + +### Advertisement Data Structure + +``` +Advertising Data (31 bytes max): +├── Flags (3 bytes) +└── Service UUID (19 bytes for 128-bit UUID) + +Scan Response (31 bytes separate budget): +└── (empty — device name not advertised) +``` + +--- + +## Address/Identity Mapping Summary + +### Python Layer (`BLEInterface`) + +| Dictionary | Key | Value | Purpose | +|------------|-----|-------|---------| +| `address_to_identity` | MAC address | 16-byte identity | MAC → identity lookup | +| `identity_to_address` | 16-char hash | MAC address | Identity → current MAC | +| `spawned_interfaces` | 16-char hash | BLEPeerInterface | Identity → interface | +| `address_to_interface` | MAC address | BLEPeerInterface | Fallback cleanup | +| `_identity_cache` | MAC address | (identity, timestamp) | Reconnection cache (60s TTL) | +| `_pending_identity_connections` | MAC address | timestamp | Timeout tracking | +| `_pending_detach` | 16-char hash | timestamp | Grace period detach | +| `pending_mtu` | MAC address | MTU value | MTU/identity race handling | +| `fragmenters` | identity_key | BLEFragmenter | Per-identity fragmentation | +| `reassemblers` | identity_key | BLEReassembler | Per-identity reassembly | + +### Kotlin Layer (`KotlinBLEBridge`) + +| Map | Key | Value | Purpose | +|-----|-----|-------|---------| +| `addressToIdentity` | MAC address | 32-char hex | MAC → identity | +| `identityToAddress` | 32-char hex | MAC address | Identity → MAC | +| `connectedPeers` | MAC address | PeerConnection | Active connections | +| `pendingConnections` | MAC address | PendingConnection | Awaiting identity | +| `pendingCentralConnections` | Set | - | In-progress central connects | +| `processedIdentityCallbacks` | Set | - | Prevent duplicate notifications | + +--- + +## Potential Issues & Recommendations + +### 1. GATT Operation Timeout (5s default) + +**Issue**: The default 5-second timeout in `BleOperationQueue` may be too short for slow or congested BLE environments. + +**Impact**: GATT operations may fail prematurely on: +- Older devices with slower BLE stacks +- Environments with high 2.4GHz interference +- During rapid connection/disconnection cycles + +**Recommendation**: Consider adaptive timeouts based on operation type and historical success rates. + +### 2. Advertising Refresh Interval (60s) + +**Issue**: The 60-second advertising refresh may miss discovery windows. + +**Impact**: If Android silently stops advertising immediately after screen-off, devices may be undiscoverable for up to 60 seconds. + +**Recommendation**: +- Reduce to 30 seconds when battery is not a concern +- Add `BroadcastReceiver` for `ACTION_SCREEN_OFF` to trigger immediate refresh + +### 3. Identity Cache Coherence + +**Issue**: The 60-second identity cache in Python may become stale if not properly synchronized with Kotlin state. + +**Impact**: Race conditions during rapid reconnection cycles could cause identity mismatches. + +**Recommendation**: Add explicit cache invalidation when Kotlin detects MAC rotation or deduplication. + +### 4. Fragmenter Key Complexity + +**Issue**: Fragmenter keys use `_get_fragmenter_key(identity, address)` but the address parameter is unused. + +**Current code**: +```python +def _get_fragmenter_key(self, peer_identity, address): + # Address unused - key is identity-based for MAC rotation immunity + return self._compute_identity_hash(peer_identity) +``` + +**Recommendation**: Remove unused `address` parameter to avoid confusion. + +### 5. Double Identity Callback Processing + +**Issue**: Both Kotlin (`onIdentityReceived`) and Python (`_handle_identity_handshake`) detect and process identity handshakes. + +**Impact**: Additional complexity and potential for desynchronization. + +**Recommendation**: Single point of identity detection (Kotlin) with Python purely as a consumer. + +### 6. Grace Period Timing + +**Issue**: The 2-second detach grace period (`_pending_detach_grace_period`) is hardcoded. + +**Impact**: May not be sufficient for slow network conditions or concurrent reconnection attempts. + +**Recommendation**: Make configurable via interface parameters, with a suggested default of 3-5 seconds. + +--- + +## Key Constants Reference + +### UUIDs (BleConstants.kt) + +| Constant | Value | +|----------|-------| +| `SERVICE_UUID` | `37145b00-442d-4a94-917f-8f42c5da28e3` | +| `CHARACTERISTIC_RX_UUID` | `37145b00-442d-4a94-917f-8f42c5da28e5` | +| `CHARACTERISTIC_TX_UUID` | `37145b00-442d-4a94-917f-8f42c5da28e4` | +| `CHARACTERISTIC_IDENTITY_UUID` | `37145b00-442d-4a94-917f-8f42c5da28e6` | +| `CCCD_UUID` | `00002902-0000-1000-8000-00805f9b34fb` | + +### Timing Constants + +| Constant | Value | Location | +|----------|-------|----------| +| `CONNECTION_TIMEOUT_MS` | 30,000 ms | BleConstants | +| `CONNECTION_KEEPALIVE_INTERVAL_MS` | 15,000 ms | BleConstants | +| `DISCOVERY_INTERVAL_MS` | 5,000 ms | BleConstants | +| `DISCOVERY_INTERVAL_IDLE_MS` | 30,000 ms | BleConstants | +| `SCAN_DURATION_MS` | 10,000 ms | BleConstants | +| `ADVERTISING_REFRESH_INTERVAL_MS` | 60,000 ms | BleAdvertiser | +| `_identity_cache_ttl` | 60 s | BLEInterface | +| `_pending_detach_grace_period` | 2.0 s | BLEInterface | + + +### MTU Constants + +| Constant | Value | Meaning | +|----------|-------|---------| +| `MIN_MTU` | 23 | BLE 4.0 minimum | +| `DEFAULT_MTU` | 185 | Reasonable default | +| `MAX_MTU` | 517 | BLE 5.0 maximum | +| `HW_MTU` | 500 | Reticulum standard | + +--- + +## File Locations + +| Component | Path | +|-----------|------| +| BLEInterface.py | `app/build/python/pip/release/common/ble_reticulum/BLEInterface.py` | +| AndroidBLEDriver | `python/ble_modules/android_ble_driver.py` | +| KotlinBLEBridge | `reticulum/src/main/java/network.columba.app/reticulum/ble/bridge/KotlinBLEBridge.kt` | +| BleGattClient | `reticulum/src/main/java/network.columba.app/reticulum/ble/client/BleGattClient.kt` | +| BleGattServer | `reticulum/src/main/java/network.columba.app/reticulum/ble/server/BleGattServer.kt` | +| BleScanner | `reticulum/src/main/java/network.columba.app/reticulum/ble/client/BleScanner.kt` | +| BleAdvertiser | `reticulum/src/main/java/network.columba.app/reticulum/ble/server/BleAdvertiser.kt` | +| BleOperationQueue | `reticulum/src/main/java/network.columba.app/reticulum/ble/util/BleOperationQueue.kt` | +| BleConstants | `reticulum/src/main/java/network.columba.app/reticulum/ble/model/BleConstants.kt` | diff --git a/pyproject.toml b/pyproject.toml index aa24bc4..93663b5 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "ble-reticulum" -version = "0.2.2" +version = "0.3.0" description = "Bluetooth Low Energy (BLE) interface for Reticulum Network Stack" readme = "README.md" requires-python = ">=3.8" diff --git a/src/ble_reticulum/BLEInterface.py b/src/ble_reticulum/BLEInterface.py index 49ed348..4973fcb 100644 --- a/src/ble_reticulum/BLEInterface.py +++ b/src/ble_reticulum/BLEInterface.py @@ -399,6 +399,27 @@ def __init__(self, owner, configuration): self._last_real_data = {} self._zombie_timeout = 30.0 # seconds - connection is zombie if no real data for this long + # Data-path liveness probe (protocol v0.4.0). A small PING(0x04)/PONG(0x05) + # round-trip over the REAL data path detects a "connected but data-dead" link + # that neither the link layer (it keeps idle links up) nor keepalives (1-byte + # writes succeed while larger data fails) can catch -- then forces a reconnect. + # The probe IS the traffic, so a healthy IDLE link is kept fresh (no churn) + # while a genuinely dead data path goes stale and is reaped. Capability is + # auto-negotiated: a peer is marked probe-capable on the first PING/PONG seen, + # and only probe-capable peers are reaped on a dead path. The 2-byte frames are + # < the 5-byte fragment header, so peers that predate the probe reject them as + # "too short" and are never falsely reaped. Intervals are config-tunable. + self._probe_ping = 0x04 + self._probe_pong = 0x05 + # PING a link that has had no real data for this many seconds. + self._probe_interval = float(c.get("data_path_probe_interval", 15.0)) + # Reconnect a probe-capable peer whose data path has been silent this long. + self._data_path_timeout = float(c.get("data_path_timeout", 45.0)) + # How often the probe/detect loop runs. + self._probe_poll_interval = float(c.get("data_path_probe_poll_interval", 10.0)) + self._probe_capable = {} # identity_hash -> True (peer speaks the probe) + self._probe_timer = None + # Fragmentation self.fragmenters = {} # address -> BLEFragmenter (per MTU) self.reassemblers = {} # address -> BLEReassembler @@ -473,6 +494,9 @@ def __init__(self, owner, configuration): self.cleanup_timer = None self._start_cleanup_timer() + # Start the data-path liveness probe loop (PING/PONG -> detect data-dead -> reconnect) + self._start_probe_timer() + # Start the interface self.start() @@ -683,6 +707,91 @@ def _clear_stale_ble_paths(self): except Exception as e: RNS.log(f"{self} Error during stale path cleanup (non-fatal): {e}", RNS.LOG_WARNING) + def _send_probe(self, address, ptype, nonce): + """Send a 2-byte data-path probe frame (PING/PONG) over the real data path.""" + try: + self.driver.send(address, bytes([ptype, nonce & 0xFF])) + except Exception as e: + RNS.log(f"{self} data-path probe send to {address} failed: {e}", RNS.LOG_DEBUG) + + def _handle_probe_frame(self, address, data): + """ + Handle an inbound data-path liveness frame. Returns True if `data` was a + probe frame (and is now consumed), False otherwise. + + Receiving ANY probe frame proves the inbound data path is alive and that the + peer speaks the probe (so it is marked probe-capable). A PING is echoed as a + PONG so the sender's round-trip completes. + """ + if len(data) != 2 or data[0] not in (self._probe_ping, self._probe_pong): + return False + # A peer can deliver a frame under its "dev:"-prefixed peripheral address + # while the central-path handshake stored its identity under the plain MAC + # (dual-role connection). Normalize so the identity resolves either way. + plain = address[4:] if address.startswith("dev:") else address + peer_identity = (self.address_to_identity.get(address) + or self.address_to_identity.get(plain) + or self.address_to_identity.get("dev:" + plain)) + if peer_identity: + identity_hash = self._compute_identity_hash(peer_identity) + self._last_real_data[identity_hash] = time.time() + self._probe_capable[identity_hash] = True + else: + RNS.log(f"{self} data-path probe from unmapped address {address}, dropping", RNS.LOG_EXTREME) + if data[0] == self._probe_ping: + self._send_probe(address, self._probe_pong, data[1]) + RNS.log(f"{self} data-path PING from {address}, replied PONG", RNS.LOG_EXTREME) + else: + RNS.log(f"{self} data-path PONG from {address}", RNS.LOG_EXTREME) + return True + + def _start_probe_timer(self): + """Start/restart the periodic data-path probe + dead-path detection loop.""" + if self._probe_timer: + self._probe_timer.cancel() + self._probe_timer = threading.Timer(self._probe_poll_interval, self._run_data_path_probes) + self._probe_timer.daemon = True + self._probe_timer.start() + + def _run_data_path_probes(self): + """ + Periodic data-path liveness sweep over the spawned peers. + + For each peer: + - If the link has been idle longer than _probe_interval, send a PING. On a + healthy link the peer echoes a PONG, which refreshes _last_real_data -- so + the probe is itself the traffic that keeps a genuinely idle-but-healthy link + from ever looking dead. Idle links are therefore never reaped. + - If a probe-capable peer's data path has been silent past _data_path_timeout, + the link is "connected but data-dead" (the link layer keeps the connection + up while real data silently fails); tear it down so it re-establishes. + + Peers that have never spoken the probe are not probe-capable and are left to + the existing reactive checks, so older peers are never falsely reaped. + """ + try: + now = time.time() + for identity_hash in list(self.spawned_interfaces.keys()): + address = self.identity_to_address.get(identity_hash) + if not address: + continue + idle = now - self._last_real_data.get(identity_hash, now) + if idle > self._probe_interval: + self._send_probe(address, self._probe_ping, int(now)) + if self._probe_capable.get(identity_hash) and idle > self._data_path_timeout: + RNS.log(f"{self} data-path dead for {identity_hash[:8]} " + f"(no real data for {idle:.0f}s > {self._data_path_timeout:.0f}s) -- reconnecting", + RNS.LOG_WARNING) + self._probe_capable.pop(identity_hash, None) + try: + self.driver.disconnect(address) + except Exception as e: + RNS.log(f"{self} probe-driven disconnect of {address} failed: {e}", RNS.LOG_DEBUG) + except Exception as e: + RNS.log(f"{self} data-path probe loop error: {e}", RNS.LOG_ERROR) + finally: + self._start_probe_timer() + def _start_cleanup_timer(self): """ Start the periodic cleanup timer. @@ -814,6 +923,7 @@ def _process_pending_detaches(self): # Clean up zombie detection tracking if identity_hash in self._last_real_data: del self._last_real_data[identity_hash] + self._probe_capable.pop(identity_hash, None) # Clean up fragmenter/reassembler now that interface is fully detached if peer_identity: frag_key = self._get_fragmenter_key(peer_identity, "") # Address unused in key computation @@ -1970,6 +2080,10 @@ def _handle_ble_data(self, peer_address, data): RNS.log(f"{self} received keep-alive from peer {peer_address}, ignoring", RNS.LOG_EXTREME) return + # Data-path liveness probe (PING/PONG round-trip over the real data path) + if self._handle_probe_frame(peer_address, data): + return + # Look up peer identity to compute fragmenter key peer_identity = self.address_to_identity.get(peer_address) if not peer_identity: @@ -2100,6 +2214,10 @@ def handle_peripheral_data(self, data, sender_address): RNS.log(f"{self} received keep-alive from central {sender_address}, ignoring", RNS.LOG_EXTREME) return + # Data-path liveness probe (PING/PONG round-trip over the real data path) + if self._handle_probe_frame(sender_address, data): + return + # Check if we have peer identity peer_identity = self.address_to_identity.get(sender_address) @@ -2342,6 +2460,11 @@ def detach(self): self.cleanup_timer.cancel() self.cleanup_timer = None + # Cancel data-path probe timer + if self._probe_timer: + self._probe_timer.cancel() + self._probe_timer = None + # Detach spawned interfaces for peer_if in list(self.spawned_interfaces.values()): peer_if.detach()