Fix RAK4631 Ethernet gateway API connection loss after W5100S brownout#9754
Fix RAK4631 Ethernet gateway API connection loss after W5100S brownout#9754PhilipLykov wants to merge 6 commits intomeshtastic:developfrom
Conversation
PoE power instability can brownout the W5100S while the nRF52 MCU keeps running, causing all chip registers (MAC, IP, sockets) to revert to defaults. The firmware had no mechanism to detect or recover from this. Changes: - Detect W5100S chip reset by periodically verifying MAC address register in reconnectETH(); on mismatch, perform full hardware reset and re-initialize Ethernet interface and services - Add deInitApiServer() for clean API server teardown during recovery - Add ~APIServerPort destructor to prevent memory leaks - Switch nRF52 from EthernetServer::available() to accept() to prevent the same connected client from being repeatedly re-reported - Add proactive dead-connection cleanup in APIServerPort::runOnce() - Add 15-minute TCP idle timeout to close half-open connections that consume limited W5100S hardware sockets Fixes meshtastic#6970 Made-with: Cursor
|
@PhilipLykov You think this would resolve this issue? #8462 |
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical issue where RAK4631 Ethernet gateway devices lose API connectivity after a few minutes due to W5100S chip brownouts caused by PoE power instability. The fix implements hardware reset detection, proper recovery mechanisms, and several improvements to TCP connection management to prevent socket exhaustion.
Changes:
- Added W5100S chip reset detection by monitoring MAC address registers and full re-initialization on mismatch
- Implemented proper API server teardown (
deInitApiServer()) and destructor to prevent memory leaks during recovery - Fixed nRF52 connection handling to use
accept()instead ofavailable()and added proactive cleanup of dead connections - Added 15-minute TCP idle timeout to prevent half-open connections from exhausting the W5100S's 4 hardware sockets
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/mesh/eth/ethClient.cpp | Implements W5100S reset detection via MAC verification and full hardware re-initialization |
| src/mesh/api/ethServerAPI.h | Adds deInitApiServer() declaration for clean API server teardown |
| src/mesh/api/ethServerAPI.cpp | Implements deInitApiServer() to properly destroy the API server |
| src/mesh/api/ServerAPI.h | Adds destructor to prevent memory leaks and isClientConnected() helper method |
| src/mesh/api/ServerAPI.cpp | Implements TCP idle timeout, proactive dead-connection cleanup, and switches nRF52 to accept() |
Address Copilot review comment: log millis() - lastContactMsec to show the real time since last client activity, rather than always logging the TCP_IDLE_TIMEOUT_MS constant. Made-with: Cursor
…com/PhilipLykov/Meshtastic_firmware into fix/rak4631-eth-api-connection-loss
|
|
Re: #8462 - This PR does not fix that issue. #8462 is a different bug on the radio side (RadioLib error -7 = RADIOLIB_ERR_CRC_MISMATCH), where the SX1262 LoRa radio stops decoding packets correctly after a few days. The two chips live on completely separate SPI buses on the RAK4631:
This PR fixes W5100S Ethernet chip brownout recovery (API/TCP connectivity loss). The radio CRC issue in #8462 would need its own detection and recovery - likely periodic verification of SX1262 configuration registers and re-initialization if corrupted. However, if both issues are triggered by the same PoE voltage transients, the root cause is the same (power instability), just affecting different chips independently. |
| { | ||
| if (client.connected()) { | ||
| if (lastContactMsec > 0 && !Throttle::isWithinTimespanMs(lastContactMsec, TCP_IDLE_TIMEOUT_MS)) { | ||
| LOG_WARN("TCP connection timeout, no data for %lu ms", (unsigned long)(millis() - lastContactMsec)); |
There was a problem hiding this comment.
should lastContactMsec also be updated in cases other than inbound API data?
looks like:
- For Ethernet TCP API,
lastContactMsecis not updated by outbound traffic, socket-level ACKs, or link-level activity. - It advances only on inbound API protobuf frames from client.
- So the new 15-minute timeout effectively enforces: client must send periodic API traffic (typically heartbeat) to stay connected.
|
@robekl Good observation — this is intentional and consistent with how the rest of the Meshtastic codebase handles connection timeouts.
Why not update on outbound traffic? The purpose of this timeout is to detect dead/abandoned connections (crashed clients leaving half-open TCP sockets). If we updated The Meshtastic protocol already accounts for this via the |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Summary
RAK4631 Ethernet gateway devices (RAK13800-W5100S) lose API connectivity after a few minutes of operation while radio and ping continue to work. Only a power reset recovers connectivity.
Root cause: PoE power instability can brownout the W5100S Ethernet chip while the nRF52 MCU keeps running. This causes all W5100S registers (MAC address, IP configuration, socket states) to silently revert to defaults. The firmware had no mechanism to detect or recover from this hardware-level reset. Several pre-existing software issues compounded the problem by preventing graceful recovery even from normal TCP disconnects.
Changes
W5100S reset detection and recovery (
ethClient.cpp): Periodically verify the W5100S MAC address register inreconnectETH(). On mismatch (indicating a chip reset), perform a full hardware reset viaPIN_ETHERNET_RESET, re-initialize the Ethernet interface (MAC, IP via DHCP/static), and trigger re-initialization of all Ethernet services (API server, NTP, syslog) by resetting theethStartupCompleteflag.API server clean teardown (
ethServerAPI.cpp/.h): AdddeInitApiServer()function (matching the existing WiFi convention) to properly destroy the API server during Ethernet re-initialization, preventing stale socket/pointer issues.APIServerPort destructor (
ServerAPI.h): Add~APIServerPort()destructor to properly delete theopenAPIinstance, preventing memory leaks.Switch nRF52 to
accept()(ServerAPI.cpp): Change nRF52 fromEthernetServer::available()toaccept()(aligning with RP2040).available()repeatedly returns the same connected client, causing the server to attempt re-accepting an already-active connection.Proactive dead-connection cleanup (
ServerAPI.cpp): Check and clean up disconnectedopenAPIinstances at the beginning ofAPIServerPort::runOnce()before accepting new connections.TCP idle timeout (
ServerAPI.cpp): Add a 15-minute inactivity timeout to forcefully close half-open TCP connections. The W5100S has only 4 hardware sockets, and half-open connections from crashed clients permanently consume them.Hardware context
Fixes #6970
Test plan
rak4631_eth_gwtarget successfully