fix: run TCP connect on its own task — instant UI / screen wake#31
Conversation
…ake) The blocking WiFiClient.connect() ran on the main loop and stalled it for the lwIP default (~18.5s) when the host was unreachable -- including the DNS lookup -- freezing the UI, so the screen took ~10s to wake. The 2-arg connect() also ignored CONNECT_TIMEOUT_MS (that only bounds reads). Move only the blocking connect() to a dedicated FreeRTOS task. read/write/frame stay on the main loop exactly as before (unchanged low-latency data path -- the link/Resource timing is untouched). An atomic _conn_state hands _client ownership between the task (while CONNECTING) and the main loop (while CONNECTED) so they never touch the socket concurrently. Bound the connect via the 3-arg connect() and back off retries to 15s. tests/hardware: wait_for_tcp_link() matched "started", keying on interface startup rather than the actual connect. With the async connect that let the harness drive the announce before the link was up, so the device's announce was lost and the first direct message (bz2-probe) failed. Match "connected to". Verified on a T-Deck: screen wake instant (connect off the main loop); e2e smoke 5/5 including bz2-on-receive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UWZuYkHBRqNb6BZHV8sTG5
Greptile SummaryThis PR moves the blocking
Confidence Score: 5/5Safe to merge — every cross-thread shared field is now properly atomic, socket ownership is cleanly transferred via the seq-cst _conn_state machine, and the task shutdown path closes within a bounded deadline before any destructor teardown. Every concern from the previous review round has been addressed: _task_running/_task_done upgraded to std::atomic, _last_connect_attempt and _reconnected made atomic, the redundant _online write removed from the task, and CONNECTED stored before _reconnected to prevent a stale announce. The stop() join now polls _task_done rather than sleeping a fixed interval, eliminating the prior use-after-free window on destruction. No new concurrency issues were found in the revised implementation. No files require special attention — the atomic sequencing in TCPClientInterface.cpp is consistent and correct throughout. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant ML as Main Loop (core 1)
participant AT as _conn_state (atomic)
participant TT as tcp_task (core 0)
Note over ML,TT: start() seeds _last_connect_attempt, spawns tcp_task
TT->>AT: load() → DISCONNECTED
TT->>AT: store(CONNECTING)
Note over TT: WiFiClient.connect() blocks here (off main loop)
ML->>AT: load() → CONNECTING
ML-->>ML: "_online=false, return early"
alt connect() succeeds
TT->>TT: "_frame_buffer.clear(), _last_data_received=millis()"
TT->>AT: store(CONNECTED)
TT->>TT: _reconnected.store(true)
ML->>AT: load() → CONNECTED
ML-->>ML: "_online=true, read/write/frame on main loop"
else connect() fails
TT->>AT: store(DISCONNECTED)
Note over TT: retry after RECONNECT_WAIT_MS
end
Note over ML: On socket drop
ML->>ML: handle_disconnect() → disconnect()
ML->>AT: store(DISCONNECTED)
Note over ML: On stop() / destructor
ML->>TT: "_task_running=false (atomic)"
ML-->>ML: poll _task_done up to 30s
TT->>TT: "exit task_loop(), _task_done=true, vTaskDelete"
ML->>AT: store(DISCONNECTED)
ML->>ML: disconnect() / _client.stop()
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant ML as Main Loop (core 1)
participant AT as _conn_state (atomic)
participant TT as tcp_task (core 0)
Note over ML,TT: start() seeds _last_connect_attempt, spawns tcp_task
TT->>AT: load() → DISCONNECTED
TT->>AT: store(CONNECTING)
Note over TT: WiFiClient.connect() blocks here (off main loop)
ML->>AT: load() → CONNECTING
ML-->>ML: "_online=false, return early"
alt connect() succeeds
TT->>TT: "_frame_buffer.clear(), _last_data_received=millis()"
TT->>AT: store(CONNECTED)
TT->>TT: _reconnected.store(true)
ML->>AT: load() → CONNECTED
ML-->>ML: "_online=true, read/write/frame on main loop"
else connect() fails
TT->>AT: store(DISCONNECTED)
Note over TT: retry after RECONNECT_WAIT_MS
end
Note over ML: On socket drop
ML->>ML: handle_disconnect() → disconnect()
ML->>AT: store(DISCONNECTED)
Note over ML: On stop() / destructor
ML->>TT: "_task_running=false (atomic)"
ML-->>ML: poll _task_done up to 30s
TT->>TT: "exit task_loop(), _task_done=true, vTaskDelete"
ML->>AT: store(DISCONNECTED)
ML->>ML: disconnect() / _client.stop()
Reviews (7): Last reviewed commit: "fix(tcp): close stop() teardown UAF wind..." | Re-trigger Greptile |
…eptile) - stop() now waits on a _task_done flag the task sets right before exiting, instead of a fixed sleep. Closes a use-after-free window where an in-flight connect() overrunning CONNECT_TIMEOUT_MS (slow DNS) could touch `this` after ~TCPClientInterface() freed it. - _last_connect_attempt is now std::atomic<uint32_t> — it's read/written by task_loop() (core 0) and handle_disconnect() (core 1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UWZuYkHBRqNb6BZHV8sTG5
…ayed (greptile) With _last_connect_attempt == 0, task_loop()'s first `now - _last_connect_attempt >= RECONNECT_WAIT_MS` check only passes once millis() >= RECONNECT_WAIT_MS, delaying the very first connect up to 15s after boot. Seed it to millis() - RECONNECT_WAIT_MS in start() so the first attempt fires immediately (unsigned wraparound keeps it correct when millis() < the wait). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UWZuYkHBRqNb6BZHV8sTG5
Both are written by task_loop() (core 0) and read by stop() (core 1); volatile gives no cross-core ordering. Use std::atomic<bool> to match the other shared flags (_conn_state, _reconnected, _last_connect_attempt) and give stop()'s join a well-defined happens-before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UWZuYkHBRqNb6BZHV8sTG5
…greptile) task_loop() set `_online = true` during the CONNECTING window, which races with loop()'s `_online = false` on the main loop (plain bool, no synchronizes-with). It's redundant: the main loop sets `_online = true` when it observes CONNECTED. Removing it eliminates the race with no behaviour change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UWZuYkHBRqNb6BZHV8sTG5
… dropped (greptile) Storing _reconnected before _conn_state=CONNECTED left a seq-cst window where the main loop could observe _reconnected==true while still CONNECTING. check_reconnected() would then clear the flag and announce on an offline interface (loop() returns early), so no announce fired once actually connected. Store CONNECTED first; seq-cst then guarantees _reconnected is only ever observed true on an online interface. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UWZuYkHBRqNb6BZHV8sTG5
…dline (greptile) stop()'s join had a fixed deadline (CONNECT_TIMEOUT_MS + 2s); a slow DNS could keep the task inside connect() past it, so stop() would free the object while the task still referenced `this`. Extend the deadline well beyond any connect()+DNS, and if it still expires, vTaskDelete(_task_handle) the task so it can't touch `this` after return. (The task's own self-delete path sets _task_done first, so this branch only runs when it has not self-deleted — no double delete.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UWZuYkHBRqNb6BZHV8sTG5
What
The TCP interface's blocking
connect()ran on the main loop, freezing the UI — the screen took ~10s to wake when the configured TCP host was unreachable. This moves the connect onto its own FreeRTOS task.Root cause
WiFiClient.connect(host, port)(2-arg) blocks the caller for the lwIP default (~18.5s), including the DNS lookup, when the host can't be reached — andsetTimeout()only bounds reads, not the connect. With a dead host the main loop spent ~18.5s per attempt inreticulum->loop()/ the TCP loop, so input + redraw (and thus screen wake) were starved. Confirmed on-device via per-loop-step timing: steps 4/6 blocked 18512 ms.Fix
connect()moves to a dedicated task.read/write/frame processing stay on the main loop exactly as before — the data path (and link/Resource timing) is unchanged._conn_state(DISCONNECTED → CONNECTING → CONNECTED) hands_clientownership between the task (while connecting) and the main loop (while connected), so the socket is never touched from two threads.connect(host, port, timeout)and backed reconnect off to 15s.tests/hardware/tdeck_harness.py:wait_for_tcp_link()matched"started", so it keyed on interface startup rather than the actual connect. With the async connect that let the harness drive the announce before the link was up (the announce was lost → first direct message failed). Now matches"connected to".Why connect-only (not full socket-on-task)
An earlier attempt moved all socket I/O to the task via stream buffers; it regressed the bz2-probe because the extra buffer hop perturbed the first link handshake. Keeping read/write on the main loop avoids that entirely.
Testing (on a T-Deck)
Notes
TCPClientInterface+ the test harness.