Skip to content

fix: run TCP connect on its own task — instant UI / screen wake#31

Merged
torlando-tech merged 7 commits into
mainfrom
fix/tcp-interface-own-task
Jun 20, 2026
Merged

fix: run TCP connect on its own task — instant UI / screen wake#31
torlando-tech merged 7 commits into
mainfrom
fix/tcp-interface-own-task

Conversation

@torlando-tech

Copy link
Copy Markdown
Owner

What

The TCP interface's blocking connect() ran on the main loop, freezing the UI — the screen took ~10s to wake when the configured TCP host was unreachable. This moves the connect onto its own FreeRTOS task.

Root cause

WiFiClient.connect(host, port) (2-arg) blocks the caller for the lwIP default (~18.5s), including the DNS lookup, when the host can't be reached — and setTimeout() only bounds reads, not the connect. With a dead host the main loop spent ~18.5s per attempt in reticulum->loop() / the TCP loop, so input + redraw (and thus screen wake) were starved. Confirmed on-device via per-loop-step timing: steps 4/6 blocked 18512 ms.

Fix

  • Only the blocking connect() moves to a dedicated task. read/write/frame processing stay on the main loop exactly as before — the data path (and link/Resource timing) is unchanged.
  • An atomic _conn_state (DISCONNECTED → CONNECTING → CONNECTED) hands _client ownership between the task (while connecting) and the main loop (while connected), so the socket is never touched from two threads.
  • Bounded the connect with the 3-arg connect(host, port, timeout) and backed reconnect off to 15s.
  • tests/hardware/tdeck_harness.py: wait_for_tcp_link() matched "started", so it keyed on interface startup rather than the actual connect. With the async connect that let the harness drive the announce before the link was up (the announce was lost → first direct message failed). Now matches "connected to".

Why connect-only (not full socket-on-task)

An earlier attempt moved all socket I/O to the task via stream buffers; it regressed the bz2-probe because the extra buffer hop perturbed the first link handshake. Keeping read/write on the main loop avoids that entirely.

Testing (on a T-Deck)

  • Per-loop-step timing: zero main-loop stalls with the host unreachable (was 18512 ms); screen wake instant.
  • e2e smoke: 5/5 including bz2-on-receive — the data path (direct + opportunistic LXMF, compressed Resource RX) is intact.

Notes

…ake)

The blocking WiFiClient.connect() ran on the main loop and stalled it for the
lwIP default (~18.5s) when the host was unreachable -- including the DNS lookup --
freezing the UI, so the screen took ~10s to wake. The 2-arg connect() also
ignored CONNECT_TIMEOUT_MS (that only bounds reads).

Move only the blocking connect() to a dedicated FreeRTOS task. read/write/frame
stay on the main loop exactly as before (unchanged low-latency data path -- the
link/Resource timing is untouched). An atomic _conn_state hands _client ownership
between the task (while CONNECTING) and the main loop (while CONNECTED) so they
never touch the socket concurrently. Bound the connect via the 3-arg connect()
and back off retries to 15s.

tests/hardware: wait_for_tcp_link() matched "started", keying on interface
startup rather than the actual connect. With the async connect that let the
harness drive the announce before the link was up, so the device's announce was
lost and the first direct message (bz2-probe) failed. Match "connected to".

Verified on a T-Deck: screen wake instant (connect off the main loop); e2e smoke
5/5 including bz2-on-receive.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UWZuYkHBRqNb6BZHV8sTG5
@greptile-apps

greptile-apps Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR moves the blocking WiFiClient.connect() call (which could stall the main loop for ~18.5s on an unreachable host) onto a dedicated FreeRTOS task, while keeping all read/write/frame processing on the main loop. An atomic _conn_state state machine (DISCONNECTED → CONNECTING → CONNECTED) transfers socket ownership between the task and the main loop so neither ever touches _client concurrently.

  • Core change (TCPClientInterface.cpp/.h): New tcp_task / task_loop() own _client only during the blocking connect() phase; on success they store CONNECTED (before _reconnected) and hand ownership to the main loop. All previously flagged data-race fields (_task_running, _task_done, _last_connect_attempt, _reconnected) are now std::atomic; the stop() shutdown sequence polls _task_done for up to 30 s instead of relying on a fixed sleep, closing the prior use-after-free window.
  • Test harness (tdeck_harness.py): wait_for_tcp_link() now matches "connected to" (the actual TCP handshake log line) rather than "started" (task startup), preventing the harness from driving an announce before the link is live.

Confidence Score: 5/5

Safe to merge — every cross-thread shared field is now properly atomic, socket ownership is cleanly transferred via the seq-cst _conn_state machine, and the task shutdown path closes within a bounded deadline before any destructor teardown.

Every concern from the previous review round has been addressed: _task_running/_task_done upgraded to std::atomic, _last_connect_attempt and _reconnected made atomic, the redundant _online write removed from the task, and CONNECTED stored before _reconnected to prevent a stale announce. The stop() join now polls _task_done rather than sleeping a fixed interval, eliminating the prior use-after-free window on destruction. No new concurrency issues were found in the revised implementation.

No files require special attention — the atomic sequencing in TCPClientInterface.cpp is consistent and correct throughout.

Important Files Changed

Filename Overview
src/TCPClientInterface.cpp Core change: blocking connect() moved to a dedicated FreeRTOS task with correct atomic state machine (_conn_state: DISCONNECTED→CONNECTING→CONNECTED). All previously-flagged data races (_task_running, _task_done, _last_connect_attempt, _online, _reconnected) are now resolved; task shutdown uses a 30s deadline poll on _task_done instead of a fixed sleep.
src/TCPClientInterface.h New ARDUINO-only members: ConnState enum, tcp_task/task_loop, TaskHandle_t, and std::atomic fields for _task_running, _task_done, _conn_state, _last_connect_attempt, _reconnected. check_reconnected() upgraded to atomic exchange. All cross-thread shared state is now properly atomic.
tests/hardware/tdeck_harness.py wait_for_tcp_link() predicate tightened to match "connected to" (actual TCP handshake log) rather than "started" (task startup). Prevents the harness from driving the announce before the link is up, which was causing the first direct message to be lost.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant ML as Main Loop (core 1)
    participant AT as _conn_state (atomic)
    participant TT as tcp_task (core 0)

    Note over ML,TT: start() seeds _last_connect_attempt, spawns tcp_task
    TT->>AT: load() → DISCONNECTED
    TT->>AT: store(CONNECTING)
    Note over TT: WiFiClient.connect() blocks here (off main loop)
    ML->>AT: load() → CONNECTING
    ML-->>ML: "_online=false, return early"

    alt connect() succeeds
        TT->>TT: "_frame_buffer.clear(), _last_data_received=millis()"
        TT->>AT: store(CONNECTED)
        TT->>TT: _reconnected.store(true)
        ML->>AT: load() → CONNECTED
        ML-->>ML: "_online=true, read/write/frame on main loop"
    else connect() fails
        TT->>AT: store(DISCONNECTED)
        Note over TT: retry after RECONNECT_WAIT_MS
    end

    Note over ML: On socket drop
    ML->>ML: handle_disconnect() → disconnect()
    ML->>AT: store(DISCONNECTED)

    Note over ML: On stop() / destructor
    ML->>TT: "_task_running=false (atomic)"
    ML-->>ML: poll _task_done up to 30s
    TT->>TT: "exit task_loop(), _task_done=true, vTaskDelete"
    ML->>AT: store(DISCONNECTED)
    ML->>ML: disconnect() / _client.stop()
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant ML as Main Loop (core 1)
    participant AT as _conn_state (atomic)
    participant TT as tcp_task (core 0)

    Note over ML,TT: start() seeds _last_connect_attempt, spawns tcp_task
    TT->>AT: load() → DISCONNECTED
    TT->>AT: store(CONNECTING)
    Note over TT: WiFiClient.connect() blocks here (off main loop)
    ML->>AT: load() → CONNECTING
    ML-->>ML: "_online=false, return early"

    alt connect() succeeds
        TT->>TT: "_frame_buffer.clear(), _last_data_received=millis()"
        TT->>AT: store(CONNECTED)
        TT->>TT: _reconnected.store(true)
        ML->>AT: load() → CONNECTED
        ML-->>ML: "_online=true, read/write/frame on main loop"
    else connect() fails
        TT->>AT: store(DISCONNECTED)
        Note over TT: retry after RECONNECT_WAIT_MS
    end

    Note over ML: On socket drop
    ML->>ML: handle_disconnect() → disconnect()
    ML->>AT: store(DISCONNECTED)

    Note over ML: On stop() / destructor
    ML->>TT: "_task_running=false (atomic)"
    ML-->>ML: poll _task_done up to 30s
    TT->>TT: "exit task_loop(), _task_done=true, vTaskDelete"
    ML->>AT: store(DISCONNECTED)
    ML->>ML: disconnect() / _client.stop()
Loading

Reviews (7): Last reviewed commit: "fix(tcp): close stop() teardown UAF wind..." | Re-trigger Greptile

Comment thread src/TCPClientInterface.cpp
Comment thread src/TCPClientInterface.cpp
torlando-agent Bot and others added 2 commits June 19, 2026 22:23
…eptile)

- stop() now waits on a _task_done flag the task sets right before exiting,
  instead of a fixed sleep. Closes a use-after-free window where an in-flight
  connect() overrunning CONNECT_TIMEOUT_MS (slow DNS) could touch `this` after
  ~TCPClientInterface() freed it.
- _last_connect_attempt is now std::atomic<uint32_t> — it's read/written by
  task_loop() (core 0) and handle_disconnect() (core 1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UWZuYkHBRqNb6BZHV8sTG5
…ayed (greptile)

With _last_connect_attempt == 0, task_loop()'s first
`now - _last_connect_attempt >= RECONNECT_WAIT_MS` check only passes once
millis() >= RECONNECT_WAIT_MS, delaying the very first connect up to 15s after
boot. Seed it to millis() - RECONNECT_WAIT_MS in start() so the first attempt
fires immediately (unsigned wraparound keeps it correct when millis() < the wait).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UWZuYkHBRqNb6BZHV8sTG5
Comment thread src/TCPClientInterface.h Outdated
Both are written by task_loop() (core 0) and read by stop() (core 1); volatile
gives no cross-core ordering. Use std::atomic<bool> to match the other shared
flags (_conn_state, _reconnected, _last_connect_attempt) and give stop()'s join
a well-defined happens-before.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UWZuYkHBRqNb6BZHV8sTG5
Comment thread src/TCPClientInterface.cpp
…greptile)

task_loop() set `_online = true` during the CONNECTING window, which races with
loop()'s `_online = false` on the main loop (plain bool, no synchronizes-with).
It's redundant: the main loop sets `_online = true` when it observes CONNECTED.
Removing it eliminates the race with no behaviour change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UWZuYkHBRqNb6BZHV8sTG5
Comment thread src/TCPClientInterface.cpp Outdated
torlando-agent Bot and others added 2 commits June 19, 2026 23:02
… dropped (greptile)

Storing _reconnected before _conn_state=CONNECTED left a seq-cst window where the
main loop could observe _reconnected==true while still CONNECTING. check_reconnected()
would then clear the flag and announce on an offline interface (loop() returns
early), so no announce fired once actually connected. Store CONNECTED first; seq-cst
then guarantees _reconnected is only ever observed true on an online interface.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UWZuYkHBRqNb6BZHV8sTG5
…dline (greptile)

stop()'s join had a fixed deadline (CONNECT_TIMEOUT_MS + 2s); a slow DNS could
keep the task inside connect() past it, so stop() would free the object while the
task still referenced `this`. Extend the deadline well beyond any connect()+DNS,
and if it still expires, vTaskDelete(_task_handle) the task so it can't touch
`this` after return. (The task's own self-delete path sets _task_done first, so
this branch only runs when it has not self-deleted — no double delete.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UWZuYkHBRqNb6BZHV8sTG5
@torlando-tech torlando-tech merged commit cdbdc4a into main Jun 20, 2026
4 checks passed
@torlando-tech torlando-tech deleted the fix/tcp-interface-own-task branch June 20, 2026 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant