Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chassis] reboot-cause is not as expected when reboot through sup (normal & abnormal) #118

Open
Javier-Tan opened this issue Jan 22, 2025 · 7 comments

Comments

@Javier-Tan
Copy link

Javier-Tan commented Jan 22, 2025

Hi team,

When we reboot from supervisor (both normally and abnormally), we aren't getting the expected reboot-cause in show reboot-cause / show reboot-cause history

Normal reboot through sup:

Example seen on 7800 LC:
2025_01_22_00_19_24  reboot                                              Wed Jan 22 12:15:30 AM UTC 2025                N/A
Example of expected on LC:
2025_01_22_00_46_01  reboot from Supervisor                   Wed 22 Jan 2025 12:43:35 AM UTC  admin    N/A

Abnormal reboot through sup: (memory exhaustion from nohup bash -c "sleep 5 && tail /dev/zero" &)

Example seen on 7800LC:
2025_01_22_00_33_07  Hardware - Other (gpi-2, description: gpi 2 detailed fault, time: 2025-01-22 00:31:44)                  N/A                              N/A     Unknown
Example of expected on LC:
2025_01_22_00_38_46  Heartbeat with the Supervisor card lost  Wed 22 Jan 2025 12:36:31 AM UTC  Unknown  User issued 'Heartbeat with the Supervisor card lost' command [User: Unknown, Time: Wed 22 Jan 2025 12:36:31 AM UTC]
@Javier-Tan
Copy link
Author

Javier-Tan commented Jan 22, 2025

@arista-nwolfe, @kenneth-arista, @arlakshm for vis

@arista-nwolfe
Copy link

@Javier-Tan is there a sonic-mgmt test which tests this that is failing due to this difference in reboot reason?

@Javier-Tan
Copy link
Author

@arista-nwolfe These are some of the ones we're seeing fail

bgp/test_startup_tsa_tsb_service.py::test_tsa_tsb_service_with_supervisor_cold_reboot Failed: Reboot cause cold did not match the trigger Reboot from Supervisor
bgp/test_startup_tsa_tsb_service.py::test_tsa_tsb_service_with_supervisor_abnormal_reboot Failed: Reboot cause Unknown did not match the trigger Heartbeat with the Supervisor card lost
bgp/test_startup_tsa_tsb_service.py::test_user_init_tsb_on_sup_while_service_run_on_dut Failed: Reboot cause cold did not match the trigger Reboot from Supervisor
bgp/test_startup_tsa_tsb_service.py::test_tsa_tsb_service_with_tsa_on_sup Failed: Reboot cause cold did not match the trigger Reboot from Supervisor

@arista-nwolfe
Copy link

Looking at the driver code it looks like likely the reason for the reboot cause listing reboot is because our Arista specific reboot script will first reboot the LCs:

def powerOffCards(platform):

under the hood it looks like this ends up running the reboot command on the LCs:
https://github.com/aristanetworks/sonic/blob/master/arista/utils/rpc/api.py#L91

It looks like if we didn't implement our own platform specific reboot handler the code would've defaulted to running /sbin/reboot and likely you'd have seen the heartbeat timeout.
https://github.com/sonic-net/sonic-utilities/blob/be870a6e70376a4dd37f37d6c89c8d2f78ece079/scripts/reboot#L290

Tagging @Staphylo in case he knows the history on why we try to be more graceful about the supervisor reboot instead of using the sonic default reboot.

@Javier-Tan
Copy link
Author

Looking at the driver code it looks like likely the reason for the reboot cause listing reboot is because our Arista specific reboot script will first reboot the LCs:

sonic/arista/utils/sonic_reboot.py

Line 22 in 526fcf6

def powerOffCards(platform):

under the hood it looks like this ends up running the reboot command on the LCs:
https://github.com/aristanetworks/sonic/blob/master/arista/utils/rpc/api.py#L91
It looks like if we didn't implement our own platform specific reboot handler the code would've defaulted to running /sbin/reboot and likely you'd have seen the heartbeat timeout. https://github.com/sonic-net/sonic-utilities/blob/be870a6e70376a4dd37f37d6c89c8d2f78ece079/scripts/reboot#L290

Tagging @Staphylo in case he knows the history on why we try to be more graceful about the supervisor reboot instead of using the sonic default reboot.

Thanks this makes sense, it'll be good if we could align reboot cause with typical SONiC behvaiour for testing / standardization proccesses if possible

@Javier-Tan
Copy link
Author

@arista-nwolfe does Arista have a work around for the reboot check failures?

@patrickmacarthur
Copy link
Contributor

Tagging @Staphylo in case he knows the history on why we try to be more graceful about the supervisor reboot instead of using the sonic default reboot.

We do a graceful reboot on the linecards because we were having issues with ext4 filesystem corruption on the linecards which we believe was caused by ungraceful reboots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants