Fix RAK4631 Ethernet gateway API connection loss after W5100S brownout by PhilipLykov · Pull Request #9754 · meshtastic/firmware

PhilipLykov · 2026-02-26T12:54:41Z

Summary

RAK4631 Ethernet gateway devices (RAK13800-W5100S) lose API connectivity after a few minutes of operation while radio and ping continue to work. Only a power reset recovers connectivity.

Root cause: PoE power instability can brownout the W5100S Ethernet chip while the nRF52 MCU keeps running. This causes all W5100S registers (MAC address, IP configuration, socket states) to silently revert to defaults. The firmware had no mechanism to detect or recover from this hardware-level reset. Several pre-existing software issues compounded the problem by preventing graceful recovery even from normal TCP disconnects.

Changes

W5100S reset detection and recovery (ethClient.cpp): Periodically verify the W5100S MAC address register in reconnectETH(). On mismatch (indicating a chip reset), perform a full hardware reset via PIN_ETHERNET_RESET, re-initialize the Ethernet interface (MAC, IP via DHCP/static), and trigger re-initialization of all Ethernet services (API server, NTP, syslog) by resetting the ethStartupComplete flag.
API server clean teardown (ethServerAPI.cpp/.h): Add deInitApiServer() function (matching the existing WiFi convention) to properly destroy the API server during Ethernet re-initialization, preventing stale socket/pointer issues.
APIServerPort destructor (ServerAPI.h): Add ~APIServerPort() destructor to properly delete the openAPI instance, preventing memory leaks.
Switch nRF52 to accept() (ServerAPI.cpp): Change nRF52 from EthernetServer::available() to accept() (aligning with RP2040). available() repeatedly returns the same connected client, causing the server to attempt re-accepting an already-active connection.
Proactive dead-connection cleanup (ServerAPI.cpp): Check and clean up disconnected openAPI instances at the beginning of APIServerPort::runOnce() before accepting new connections.
TCP idle timeout (ServerAPI.cpp): Add a 15-minute inactivity timeout to forcefully close half-open TCP connections. The W5100S has only 4 hardware sockets, and half-open connections from crashed clients permanently consume them.

Hardware context

W5100S has a strict limit of 4 hardware sockets
W5100S lacks hardware TCP keepalive
PoE-powered devices are susceptible to voltage transients that can brownout the W5100S while the MCU continues running
Symptom observable on the network switch: multiple random MAC addresses appearing on the port where only one device is connected

Fixes #6970

Test plan

Build rak4631_eth_gw target successfully
Deploy to RAK4631 + RAK13800 device with PoE
Verify API connectivity persists beyond the previous failure window (typically a few minutes)
Verify radio functionality is unaffected
Monitor switch port for MAC address stability (should see only one consistent MAC)
Verify DHCP lease renewal works after recovery
Test with static IP configuration
Verify NTP re-synchronization after recovery
Confirm no regressions on ESP32 and RP2040 Ethernet targets

PoE power instability can brownout the W5100S while the nRF52 MCU keeps running, causing all chip registers (MAC, IP, sockets) to revert to defaults. The firmware had no mechanism to detect or recover from this. Changes: - Detect W5100S chip reset by periodically verifying MAC address register in reconnectETH(); on mismatch, perform full hardware reset and re-initialize Ethernet interface and services - Add deInitApiServer() for clean API server teardown during recovery - Add ~APIServerPort destructor to prevent memory leaks - Switch nRF52 from EthernetServer::available() to accept() to prevent the same connected client from being repeatedly re-reported - Add proactive dead-connection cleanup in APIServerPort::runOnce() - Add 15-minute TCP idle timeout to close half-open connections that consume limited W5100S hardware sockets Fixes meshtastic#6970 Made-with: Cursor

Xaositek · 2026-02-26T13:36:09Z

@PhilipLykov You think this would resolve this issue? #8462

Copilot

Pull request overview

This PR fixes a critical issue where RAK4631 Ethernet gateway devices lose API connectivity after a few minutes due to W5100S chip brownouts caused by PoE power instability. The fix implements hardware reset detection, proper recovery mechanisms, and several improvements to TCP connection management to prevent socket exhaustion.

Changes:

Added W5100S chip reset detection by monitoring MAC address registers and full re-initialization on mismatch
Implemented proper API server teardown (deInitApiServer()) and destructor to prevent memory leaks during recovery
Fixed nRF52 connection handling to use accept() instead of available() and added proactive cleanup of dead connections
Added 15-minute TCP idle timeout to prevent half-open connections from exhausting the W5100S's 4 hardware sockets

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/mesh/eth/ethClient.cpp	Implements W5100S reset detection via MAC verification and full hardware re-initialization
src/mesh/api/ethServerAPI.h	Adds `deInitApiServer()` declaration for clean API server teardown
src/mesh/api/ethServerAPI.cpp	Implements `deInitApiServer()` to properly destroy the API server
src/mesh/api/ServerAPI.h	Adds destructor to prevent memory leaks and `isClientConnected()` helper method
src/mesh/api/ServerAPI.cpp	Implements TCP idle timeout, proactive dead-connection cleanup, and switches nRF52 to `accept()`

src/mesh/api/ServerAPI.h

src/mesh/api/ServerAPI.cpp

Address Copilot review comment: log millis() - lastContactMsec to show the real time since last client activity, rather than always logging the TCP_IDLE_TIMEOUT_MS constant. Made-with: Cursor

…com/PhilipLykov/Meshtastic_firmware into fix/rak4631-eth-api-connection-loss

PhilipLykov · 2026-02-26T13:51:31Z

@PhilipLykov You think this would resolve this issue? #8462
There is mixed the issue with -7 in radio with issue with the Ethernet. But I believe my PR will fix the issue with Ethernet described there.

PhilipLykov · 2026-02-26T13:51:45Z

Re: #8462 - This PR does not fix that issue. #8462 is a different bug on the radio side (RadioLib error -7 = RADIOLIB_ERR_CRC_MISMATCH), where the SX1262 LoRa radio stops decoding packets correctly after a few days.

The two chips live on completely separate SPI buses on the RAK4631:

SX1262 (LoRa radio): SPI main bus (pins 42-45)
W5100S (Ethernet): SPI1 secondary bus (pins 3, 29, 30, 26)

This PR fixes W5100S Ethernet chip brownout recovery (API/TCP connectivity loss). The radio CRC issue in #8462 would need its own detection and recovery - likely periodic verification of SX1262 configuration registers and re-initialization if corrupted.

However, if both issues are triggered by the same PoE voltage transients, the root cause is the same (power instability), just affecting different chips independently.

robekl · 2026-02-26T17:09:54Z

src/mesh/api/ServerAPI.cpp

 {
    if (client.connected()) {
+        if (lastContactMsec > 0 && !Throttle::isWithinTimespanMs(lastContactMsec, TCP_IDLE_TIMEOUT_MS)) {
+            LOG_WARN("TCP connection timeout, no data for %lu ms", (unsigned long)(millis() - lastContactMsec));


should lastContactMsec also be updated in cases other than inbound API data?

looks like:

For Ethernet TCP API, lastContactMsec is not updated by outbound traffic, socket-level ACKs, or link-level activity.

It advances only on inbound API protobuf frames from client.

So the new 15-minute timeout effectively enforces: client must send periodic API traffic (typically heartbeat) to stay connected.

PhilipLykov · 2026-02-26T18:45:25Z

@robekl Good observation — this is intentional and consistent with how the rest of the Meshtastic codebase handles connection timeouts.

lastContactMsec is only updated on inbound data by design. Both SerialConsole and SerialModule use the exact same pattern — a 15-minute timeout based solely on inbound lastContactMsec (see SERIAL_CONNECTION_TIMEOUT in SerialConsole.cpp:27 and SerialModule.cpp:61). Our TCP timeout mirrors this existing convention.

Why not update on outbound traffic? The purpose of this timeout is to detect dead/abandoned connections (crashed clients leaving half-open TCP sockets). If we updated lastContactMsec on outbound data, a node that keeps forwarding mesh packets to a dead client would keep the connection alive forever — exactly the socket leak we need to prevent. The W5100S has only 4 hardware sockets; we cannot afford that.

The Meshtastic protocol already accounts for this via the heartbeat message (meshtastic_ToRadio_heartbeat_tag). Clients are expected to send periodic heartbeats, and each heartbeat flows through handleToRadio() which updates lastContactMsec. The official Meshtastic app sends these regularly, so compliant clients will never hit the 15-minute window.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

github-actions bot added the bugfix Pull request that fixes bugs label Feb 26, 2026

Xaositek assigned PhilipLykov Feb 26, 2026

Merge branch 'develop' into fix/rak4631-eth-api-connection-loss

38b3117

thebentern requested a review from Copilot February 26, 2026 13:39

Copilot AI reviewed Feb 26, 2026

View reviewed changes

src/mesh/api/ServerAPI.h Outdated Show resolved Hide resolved

src/mesh/api/ServerAPI.cpp Outdated Show resolved Hide resolved

Copilot started reviewing on behalf of thebentern February 26, 2026 13:45 View session

PhilipLykov added 2 commits February 26, 2026 15:49

Log actual elapsed idle time instead of constant timeout value

477276c

Address Copilot review comment: log millis() - lastContactMsec to show the real time since last client activity, rather than always logging the TCP_IDLE_TIMEOUT_MS constant. Made-with: Cursor

Merge branch 'fix/rak4631-eth-api-connection-loss' of https://github.…

ceeed72

…com/PhilipLykov/Meshtastic_firmware into fix/rak4631-eth-api-connection-loss

robekl reviewed Feb 26, 2026

View reviewed changes

thebentern and others added 2 commits February 27, 2026 05:34

Update src/mesh/api/ServerAPI.h

1a7439d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'develop' into fix/rak4631-eth-api-connection-loss

f751adc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix RAK4631 Ethernet gateway API connection loss after W5100S brownout#9754

Fix RAK4631 Ethernet gateway API connection loss after W5100S brownout#9754
PhilipLykov wants to merge 6 commits intomeshtastic:developfrom
PhilipLykov:fix/rak4631-eth-api-connection-loss

PhilipLykov commented Feb 26, 2026 •

edited

Loading

Uh oh!

Xaositek commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

PhilipLykov commented Feb 26, 2026

Uh oh!

PhilipLykov commented Feb 26, 2026

Uh oh!

robekl Feb 26, 2026

Uh oh!

PhilipLykov commented Feb 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

PhilipLykov commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Hardware context

Test plan

Uh oh!

Xaositek commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

PhilipLykov commented Feb 26, 2026

Uh oh!

PhilipLykov commented Feb 26, 2026

Uh oh!

robekl Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

PhilipLykov commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

PhilipLykov commented Feb 26, 2026 •

edited

Loading

PhilipLykov commented Feb 26, 2026 •

edited

Loading