Skip to content

fix: verify connection is still alive before rejecting duplicate identity#38

Merged
torlando-tech merged 8 commits into
mainfrom
fix/stale-identity-to-address-rejection
Jan 18, 2026
Merged

fix: verify connection is still alive before rejecting duplicate identity#38
torlando-tech merged 8 commits into
mainfrom
fix/stale-identity-to-address-rejection

Conversation

@torlando-tech

Copy link
Copy Markdown
Owner

Summary

Fixes a bug where legitimate reconnections from the same identity were incorrectly rejected as "duplicates".

Problem: When a peer disconnects, identity_to_address is NOT cleaned up immediately - it's only removed after a 2-second grace period in _process_pending_detaches(). However, _check_duplicate_identity() was not checking if the existing address is still connected before rejecting.

This caused:

  • Same identity trying to reconnect with a new MAC (Android MAC rotation) → REJECTED
  • Same identity reconnecting after disconnect → REJECTED during grace period

Solution: Before rejecting as duplicate, verify that the old connection is still alive:

  1. Check if pending_detach exists for this identity (meaning old connection is already gone)
  2. Check if existing address is still in connected_peers or peers dict

If either check fails, the old connection is dead → allow the new connection and clean up stale mappings.

Test plan

  • Added TDD-style tests in tests/test_ble_duplicate_identity_stale.py
  • Updated existing tests in test_mac_rotation_blacklist_bug.py to set up active connections
  • All 306 tests pass

Changes

  • src/ble_reticulum/BLEInterface.py: Added connection liveness checks to _check_duplicate_identity()
  • tests/test_ble_duplicate_identity_stale.py: New TDD test file demonstrating the bug and fix
  • tests/test_mac_rotation_blacklist_bug.py: Updated mock fixtures to include peers, _pending_detach, and connected_peers

🤖 Generated with Claude Code

…tity

When a peer disconnects, identity_to_address is NOT cleaned up immediately -
it's only removed after a 2-second grace period. However, _check_duplicate_identity
was not checking if the existing address is still connected before rejecting.

This caused legitimate reconnections from the same identity (after MAC rotation
or reconnection) to be incorrectly rejected as "duplicates" during the grace
period or when cleanup was delayed.

The fix adds two checks before rejecting:
1. If pending_detach exists for this identity (old connection already gone)
2. If existing address is not in connected_peers or peers dict

Also adds TDD tests that demonstrate the bug and verify the fix.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@codecov

codecov Bot commented Jan 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented Jan 18, 2026

Copy link
Copy Markdown

Greptile Summary

Fixed a critical bug where legitimate reconnections from the same identity were incorrectly rejected as duplicates. The issue occurred because identity_to_address cleanup happens after a 2-second grace period, but _check_duplicate_identity() wasn't verifying if the old connection was still alive.

Key Changes

  • Connection Liveness Checks: Added three-tier verification in _check_duplicate_identity() to check if existing connection is truly active before rejecting

    • Check 1: Pending detach exists (old connection already gone)
    • Check 2: Address not in connected_peers or peers dict (stale entry)
    • Check 3: Zombie detection (keepalives work but no real data for 30+ seconds)
  • Zombie Connection Detection: New feature to handle degraded BLE links where 1-byte keepalives succeed but larger data packets fail

    • Tracks _last_real_data timestamp for each identity
    • 30-second configurable timeout (_zombie_timeout)
    • Automatically disconnects zombie connections when replacement arrives
  • Duplicate Handshake Handling: Fixed race condition where Kotlin callback sets identity before 16-byte handshake data arrives through data channel, preventing reassembler errors

Issues Found

  • Logic bug in Check 2 (line 1067): Nested conditionals create a gap where addresses in driver.connected_peers but not in self.peers won't trigger cleanup. Should use AND logic instead of nested IF.

Confidence Score: 4/5

  • Safe to merge with one logic issue that should be addressed
  • The fix addresses a real bug with well-documented behavior and comprehensive test coverage (306 tests passing). However, Check 2's nested conditional logic has a gap that could allow stale connections in edge cases. The zombie detection feature is well-designed and tested. Overall implementation is solid but the logic issue should be fixed before merge.
  • Pay close attention to src/ble_reticulum/BLEInterface.py line 1067 - the nested conditional logic needs correction

Important Files Changed

Filename Overview
src/ble_reticulum/BLEInterface.py Added connection liveness checks to prevent false duplicate rejections. Implemented zombie connection detection to handle degraded BLE links. Check 2 has a logical issue with nested conditionals.
tests/test_ble_duplicate_identity_stale.py New TDD test file demonstrating the stale entry bug and verifying the fix with comprehensive test cases for all connection states.
tests/test_zombie_connection_detection.py New test file with thorough coverage of zombie connection detection feature, including edge cases and threshold testing.

Sequence Diagram

sequenceDiagram
    participant Peer as Peer Device
    participant Driver as BLE Driver
    participant IF as BLEInterface
    participant Maps as State Maps

    Note over Peer,Maps: Scenario 1: Stale Entry After Disconnect

    Peer->>Driver: Connect (MAC_OLD)
    Driver->>IF: _connected_callback(MAC_OLD, identity_A)
    IF->>Maps: identity_to_address[hash_A] = MAC_OLD
    IF->>Maps: address_to_identity[MAC_OLD] = identity_A

    Note over Peer,Maps: Connection established

    Peer->>Driver: Disconnect
    Driver->>IF: _disconnected_callback(MAC_OLD)
    IF->>Maps: _pending_detach[hash_A] = timestamp
    Note over Maps: identity_to_address[hash_A] still exists<br/>(2s grace period)

    Peer->>Driver: Reconnect (MAC_NEW, identity_A)
    Driver->>IF: _check_duplicate_identity(MAC_NEW, identity_A)
    
    alt Check 1: pending_detach exists
        IF->>Maps: Check _pending_detach[hash_A]
        Maps-->>IF: Found (connection gone)
        IF->>IF: _cleanup_stale_address(hash_A, MAC_OLD)
        IF-->>Driver: False (allow connection)
    else Check 2: not in connected_peers
        IF->>Maps: Check driver.connected_peers[MAC_OLD]
        Maps-->>IF: Not found
        IF->>Maps: Check peers[MAC_OLD]
        Maps-->>IF: Not found
        IF->>IF: _cleanup_stale_address(hash_A, MAC_OLD)
        IF-->>Driver: False (allow connection)
    end

    Driver->>IF: Connection proceeds with MAC_NEW
    
    Note over Peer,Maps: Scenario 2: Zombie Connection Detection

    Peer->>IF: Keepalive packets only (1 byte)
    Note over IF,Maps: No real data for 30+ seconds<br/>_last_real_data[hash_A] = old_timestamp

    Peer->>Driver: Reconnect attempt (MAC_NEW, identity_A)
    Driver->>IF: _check_duplicate_identity(MAC_NEW, identity_A)
    
    IF->>Maps: Check driver.connected_peers[MAC_OLD]
    Maps-->>IF: Still connected (keepalives work)
    
    alt Check 3: zombie connection
        IF->>Maps: Check _last_real_data[hash_A]
        Maps-->>IF: 60 seconds ago
        IF->>IF: time_since_data > _zombie_timeout (30s)
        IF->>IF: _cleanup_stale_address(hash_A, MAC_OLD)
        IF->>Driver: disconnect(MAC_OLD)
        IF-->>Driver: False (allow connection)
    end

    Driver->>IF: Connection proceeds with MAC_NEW
    
    Note over Peer,Maps: Scenario 3: Healthy Connection (Reject Duplicate)

    Peer->>IF: Regular data packets
    IF->>Maps: _last_real_data[hash_A] = current_time
    
    Peer->>Driver: Duplicate attempt (MAC_NEW, identity_A)
    Driver->>IF: _check_duplicate_identity(MAC_NEW, identity_A)
    
    IF->>Maps: Check _pending_detach[hash_A]
    Maps-->>IF: Not found
    IF->>Maps: Check driver.connected_peers[MAC_OLD]
    Maps-->>IF: Connected
    IF->>Maps: Check _last_real_data[hash_A]
    Maps-->>IF: Recent (< 30s)
    
    IF-->>Driver: True (reject duplicate)
    Driver->>Peer: Connection rejected
Loading

torlando-tech and others added 6 commits January 18, 2026 12:47
When BLE link degrades, 1-byte keepalives may still work while larger data
packets fail. Both sides think the connection is "alive" based on keepalives,
but data can't flow. This causes a deadlock where new connections are
rejected as "duplicates" even though the existing connection is non-functional.

This change adds zombie detection by tracking when real data (not keepalives)
was last received. If an existing connection has only exchanged keepalives
for > 30 seconds (configurable via _zombie_timeout), new connections from
the same identity are allowed and the zombie connection is disconnected.

Changes:
- Add _last_real_data dict to track last real data timestamp per identity
- Add _zombie_timeout (default 30s) for configurable zombie threshold
- Update _check_duplicate_identity with Check 3: zombie detection
- Update _handle_ble_data to track real data activity after keepalive filter
- Initialize tracking in _handle_identity_handshake and _spawn_peer_interface
- Clean up tracking in _process_pending_detaches
- Add comprehensive test suite for zombie detection

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… via Kotlin callback

When Kotlin provides the identity via callback (from the identity characteristic read),
the address_to_identity mapping gets set BEFORE the 16-byte handshake data arrives
through _data_received_callback. Previously, _handle_identity_handshake would see the
identity already exists and return False, causing the 16-byte handshake data to be
passed to the reassembler where it fails with "Invalid fragment type 0xXX".

The fix checks if received 16-byte data matches the known identity and consumes it
silently if so. This prevents the handshake data from being misinterpreted as a
fragment.

Symptoms fixed:
- BLEReassembler: Invalid fragment type 0xc9 (first byte of peer identity)
- Messages not flowing even though connections appear established

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests verify that:
- Duplicate 16-byte handshake matching known identity is consumed
- Different 16-byte data is also consumed to prevent reassembler errors
- Non-16-byte data is not incorrectly consumed as handshake
- Normal handshake processing works when identity not yet known

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…_identity

Add tests covering previously uncovered code paths:
- Pending detach check (Check 1) allowing reconnection
- Not-connected check (Check 2) allowing reconnection
- Exception handling when zombie disconnect fails

Improves patch coverage for PR #38 from 48.57% to full coverage
of the _check_duplicate_identity changes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add test for _pending_identity_connections cleanup during successful
identity handshake (lines 1272-1275), achieving 100% patch coverage
for PR #38 changes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace Mock-based fixtures with real BLEInterface instances in
stale identity check tests. This ensures coverage.py properly
tracks execution of production code paths.

The Mock approach with method binding executed the production code
but coverage tracking was inconsistent. Using real instances
guarantees proper coverage attribution.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@torlando-tech

Copy link
Copy Markdown
Owner Author

@greptileai

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/ble_reticulum/BLEInterface.py
…tests

Add tests to test_zombie_connection_detection.py (which CI runs) to cover:
- _handle_identity_handshake: non-16-byte rejection, duplicate handling
- _pending_identity_connections cleanup after handshake
- _spawn_peer_interface zombie tracking initialization

These tests cover the same code paths as test_v2_2_identity_handshake.py
but are in a file that CI includes, achieving 100% patch coverage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@torlando-tech torlando-tech merged commit 07d9413 into main Jan 18, 2026
13 checks passed
@torlando-tech torlando-tech deleted the fix/stale-identity-to-address-rejection branch January 18, 2026 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant