-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🤔[question] Can not connect to master node #9201
Comments
hello, sorry to hear this.
did they work fine after the restart? we generally suggest having an automated restart configured (e.g. through systemd units) because there're some networking errors the agent process cannot recover from by reconnecting, and to quit & restasrt. |
Restarting the agents didn't work, but after restarting the master node, the connection is now stable. Btw, if I use systemd, how can systemd detect the state of the nodes (both master and agent) to trigger a restart? |
we will try to repro and investigate internally. |
we can't repro it. can you please provide more master logs, covering what happened between the connectivity issues and when you manually restarted it? |
I forgot to save the logs before restarting, here is the log after restart |
Unfortunately the logs before the restart is specifically what we need to see. If this issue happens again, please retain these logs and share them with us. |
Issue Description:
Following a system instability around 20:29 on April 18th, 2024, several agents experienced WebSocket failures that led to multiple unsuccessful reconnection attempts and eventual crashes. This issue persisted across multiple agents (IDs ranging from server2 to server18), with each displaying similar patterns of i/o timeout errors and write: broken pipe messages.
Key Points:
At 20:29, a bulk of WebSocket reconnection attempts failed almost simultaneously across various agents, showing errors such as read loop: reading message: write tcp 172.19.0.3:8080->10.14.4.X:XXXXX: i/o timeout.
Each failed attempt was followed by the system attempting to drain and then eventually remove the agent, while also noting that the agent is "past reconnect period, it must restart".
By 02:49 on April 19th, there were continued issues noted as websocket handler error: error while reading initial startup message: websocket: close 1001 (going away), indicating ongoing connectivity or configuration issues even several hours after the initial incident.
The agents seem unable to maintain stable connections post-crash, with repeated failures to upgrade connections and repeated logs of agents needing to restart due to passing the reconnect period.
master log:
one of the agent log:
Manual restart agent(Master reaction):
Manual restart agent(Agent reaction):
Checklist
The text was updated successfully, but these errors were encountered: