Skip to content

Fix RAK4631 Ethernet gateway API connection loss after W5100S brownout#9754

Open
PhilipLykov wants to merge 6 commits intomeshtastic:developfrom
PhilipLykov:fix/rak4631-eth-api-connection-loss
Open

Fix RAK4631 Ethernet gateway API connection loss after W5100S brownout#9754
PhilipLykov wants to merge 6 commits intomeshtastic:developfrom
PhilipLykov:fix/rak4631-eth-api-connection-loss

Conversation

@PhilipLykov
Copy link
Contributor

@PhilipLykov PhilipLykov commented Feb 26, 2026

Summary

RAK4631 Ethernet gateway devices (RAK13800-W5100S) lose API connectivity after a few minutes of operation while radio and ping continue to work. Only a power reset recovers connectivity.

Root cause: PoE power instability can brownout the W5100S Ethernet chip while the nRF52 MCU keeps running. This causes all W5100S registers (MAC address, IP configuration, socket states) to silently revert to defaults. The firmware had no mechanism to detect or recover from this hardware-level reset. Several pre-existing software issues compounded the problem by preventing graceful recovery even from normal TCP disconnects.

Changes

  1. W5100S reset detection and recovery (ethClient.cpp): Periodically verify the W5100S MAC address register in reconnectETH(). On mismatch (indicating a chip reset), perform a full hardware reset via PIN_ETHERNET_RESET, re-initialize the Ethernet interface (MAC, IP via DHCP/static), and trigger re-initialization of all Ethernet services (API server, NTP, syslog) by resetting the ethStartupComplete flag.

  2. API server clean teardown (ethServerAPI.cpp/.h): Add deInitApiServer() function (matching the existing WiFi convention) to properly destroy the API server during Ethernet re-initialization, preventing stale socket/pointer issues.

  3. APIServerPort destructor (ServerAPI.h): Add ~APIServerPort() destructor to properly delete the openAPI instance, preventing memory leaks.

  4. Switch nRF52 to accept() (ServerAPI.cpp): Change nRF52 from EthernetServer::available() to accept() (aligning with RP2040). available() repeatedly returns the same connected client, causing the server to attempt re-accepting an already-active connection.

  5. Proactive dead-connection cleanup (ServerAPI.cpp): Check and clean up disconnected openAPI instances at the beginning of APIServerPort::runOnce() before accepting new connections.

  6. TCP idle timeout (ServerAPI.cpp): Add a 15-minute inactivity timeout to forcefully close half-open TCP connections. The W5100S has only 4 hardware sockets, and half-open connections from crashed clients permanently consume them.

Hardware context

  • W5100S has a strict limit of 4 hardware sockets
  • W5100S lacks hardware TCP keepalive
  • PoE-powered devices are susceptible to voltage transients that can brownout the W5100S while the MCU continues running
  • Symptom observable on the network switch: multiple random MAC addresses appearing on the port where only one device is connected

Fixes #6970

Test plan

  • Build rak4631_eth_gw target successfully
  • Deploy to RAK4631 + RAK13800 device with PoE
  • Verify API connectivity persists beyond the previous failure window (typically a few minutes)
  • Verify radio functionality is unaffected
  • Monitor switch port for MAC address stability (should see only one consistent MAC)
  • Verify DHCP lease renewal works after recovery
  • Test with static IP configuration
  • Verify NTP re-synchronization after recovery
  • Confirm no regressions on ESP32 and RP2040 Ethernet targets

PoE power instability can brownout the W5100S while the nRF52 MCU keeps
running, causing all chip registers (MAC, IP, sockets) to revert to
defaults. The firmware had no mechanism to detect or recover from this.

Changes:
- Detect W5100S chip reset by periodically verifying MAC address register
  in reconnectETH(); on mismatch, perform full hardware reset and
  re-initialize Ethernet interface and services
- Add deInitApiServer() for clean API server teardown during recovery
- Add ~APIServerPort destructor to prevent memory leaks
- Switch nRF52 from EthernetServer::available() to accept() to prevent
  the same connected client from being repeatedly re-reported
- Add proactive dead-connection cleanup in APIServerPort::runOnce()
- Add 15-minute TCP idle timeout to close half-open connections that
  consume limited W5100S hardware sockets

Fixes meshtastic#6970

Made-with: Cursor
@github-actions github-actions bot added the bugfix Pull request that fixes bugs label Feb 26, 2026
@Xaositek
Copy link
Contributor

@PhilipLykov You think this would resolve this issue? #8462

@thebentern thebentern requested a review from Copilot February 26, 2026 13:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical issue where RAK4631 Ethernet gateway devices lose API connectivity after a few minutes due to W5100S chip brownouts caused by PoE power instability. The fix implements hardware reset detection, proper recovery mechanisms, and several improvements to TCP connection management to prevent socket exhaustion.

Changes:

  • Added W5100S chip reset detection by monitoring MAC address registers and full re-initialization on mismatch
  • Implemented proper API server teardown (deInitApiServer()) and destructor to prevent memory leaks during recovery
  • Fixed nRF52 connection handling to use accept() instead of available() and added proactive cleanup of dead connections
  • Added 15-minute TCP idle timeout to prevent half-open connections from exhausting the W5100S's 4 hardware sockets

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/mesh/eth/ethClient.cpp Implements W5100S reset detection via MAC verification and full hardware re-initialization
src/mesh/api/ethServerAPI.h Adds deInitApiServer() declaration for clean API server teardown
src/mesh/api/ethServerAPI.cpp Implements deInitApiServer() to properly destroy the API server
src/mesh/api/ServerAPI.h Adds destructor to prevent memory leaks and isClientConnected() helper method
src/mesh/api/ServerAPI.cpp Implements TCP idle timeout, proactive dead-connection cleanup, and switches nRF52 to accept()

Address Copilot review comment: log millis() - lastContactMsec to show
the real time since last client activity, rather than always logging the
TCP_IDLE_TIMEOUT_MS constant.

Made-with: Cursor
@PhilipLykov
Copy link
Contributor Author

@PhilipLykov You think this would resolve this issue? #8462
There is mixed the issue with -7 in radio with issue with the Ethernet. But I believe my PR will fix the issue with Ethernet described there.

@PhilipLykov
Copy link
Contributor Author

Re: #8462 - This PR does not fix that issue. #8462 is a different bug on the radio side (RadioLib error -7 = RADIOLIB_ERR_CRC_MISMATCH), where the SX1262 LoRa radio stops decoding packets correctly after a few days.

The two chips live on completely separate SPI buses on the RAK4631:

  • SX1262 (LoRa radio): SPI main bus (pins 42-45)
  • W5100S (Ethernet): SPI1 secondary bus (pins 3, 29, 30, 26)

This PR fixes W5100S Ethernet chip brownout recovery (API/TCP connectivity loss). The radio CRC issue in #8462 would need its own detection and recovery - likely periodic verification of SX1262 configuration registers and re-initialization if corrupted.

However, if both issues are triggered by the same PoE voltage transients, the root cause is the same (power instability), just affecting different chips independently.

{
if (client.connected()) {
if (lastContactMsec > 0 && !Throttle::isWithinTimespanMs(lastContactMsec, TCP_IDLE_TIMEOUT_MS)) {
LOG_WARN("TCP connection timeout, no data for %lu ms", (unsigned long)(millis() - lastContactMsec));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should lastContactMsec also be updated in cases other than inbound API data?

looks like:

  • For Ethernet TCP API, lastContactMsec is not updated by outbound traffic, socket-level ACKs, or link-level activity.
  • It advances only on inbound API protobuf frames from client.
  • So the new 15-minute timeout effectively enforces: client must send periodic API traffic (typically heartbeat) to stay connected.

@PhilipLykov
Copy link
Contributor Author

PhilipLykov commented Feb 26, 2026

@robekl Good observation — this is intentional and consistent with how the rest of the Meshtastic codebase handles connection timeouts.

lastContactMsec is only updated on inbound data by design. Both SerialConsole and SerialModule use the exact same pattern — a 15-minute timeout based solely on inbound lastContactMsec (see SERIAL_CONNECTION_TIMEOUT in SerialConsole.cpp:27 and SerialModule.cpp:61). Our TCP timeout mirrors this existing convention.

Why not update on outbound traffic? The purpose of this timeout is to detect dead/abandoned connections (crashed clients leaving half-open TCP sockets). If we updated lastContactMsec on outbound data, a node that keeps forwarding mesh packets to a dead client would keep the connection alive forever — exactly the socket leak we need to prevent. The W5100S has only 4 hardware sockets; we cannot afford that.

The Meshtastic protocol already accounts for this via the heartbeat message (meshtastic_ToRadio_heartbeat_tag). Clients are expected to send periodic heartbeats, and each heartbeat flows through handleToRadio() which updates lastContactMsec. The official Meshtastic app sends these regularly, so compliant clients will never hit the 15-minute window.

thebentern and others added 2 commits February 27, 2026 05:34
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Pull request that fixes bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Can't connect to WisMesh Ethernet over network through App or official web client

5 participants