You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/configuration/syslog-health-monitor.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
## Overview
4
4
5
-
The Syslog Health Monitor module watches system logs for GPU errors (XID/SXID) and GPU-fallen-off events by reading journald logs. This document covers all Helm configuration options for system administrators.
5
+
The Syslog Health Monitor module watches system logs for GPU errors (XID/SXID), GPU-fallen-off, and GPU reset events by reading journald logs. This document covers all Helm configuration options for system administrators.
6
6
7
7
## Configuration Reference
8
8
@@ -55,7 +55,7 @@ syslog-health-monitor:
55
55
### Check Types
56
56
57
57
#### SysLogsXIDError
58
-
Monitors for XID (GPU error) messages in system logs. XIDs are NVIDIA GPU error codes that indicate hardware or software issues.
58
+
Monitors for XID (GPU error) and GPU reset messages in system logs. XIDs are NVIDIA GPU error codes that indicate hardware or software issues.
59
59
60
60
#### SysLogsSXIDError
61
61
Monitors for SXID messages specific to NVSwitch errors in multi-GPU configurations.
Copy file name to clipboardExpand all lines: docs/syslog-health-monitor.md
+10-6Lines changed: 10 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,26 +2,27 @@
2
2
3
3
## Overview
4
4
5
-
The Syslog Health Monitor watches system logs for GPU-related errors that may not be caught by DCGM. It monitors journald/syslog for XID errors, SXID errors (NVSwitch/NVLink errors), and GPU fallen-off-bus events - critical failures that indicate serious GPU, NVSwitch, or driver problems.
5
+
The Syslog Health Monitor watches system logs for GPU-related errors that may not be caught by DCGM. It monitors journald/syslog for XID errors, SXID errors (NVSwitch/NVLink errors), and GPU fallen-off-bus events - critical failures that indicate serious GPU, NVSwitch, or driver problems. In addition to failures, it monitors system logs for other GPU-related events such as GPU resets to indicate that a required remediation action has completed.
6
6
7
7
Think of it as a log analyzer that reads between the lines - catching GPU and NVSwitch problems recorded in system logs that other monitoring might miss.
8
8
9
9
### Why Do You Need This?
10
10
11
-
Some GPU and NVSwitch failures manifest in system logs before DCGM can detect them:
11
+
Some GPU and NVSwitch failures or events manifest in system logs before DCGM can detect them:
12
12
13
13
-**XID errors**: GPU hardware errors logged by the NVIDIA driver
14
14
-**SXID errors**: NVSwitch errors related to NVSwitch and NVLink interconnects
15
15
-**GPU fallen off the bus**: GPU became inaccessible to the system
16
+
-**GPU Reset**: A GPU reset was executed by nvidia-smi
16
17
17
-
These errors often appear in system logs first and can indicate imminent GPU or fabric failure, making early detection critical for preventing workload disruptions.
18
+
These errors or events often appear in system logs first and can indicate imminent GPU or fabric failure, making early detection critical for preventing workload disruptions or returning GPUs to service.
18
19
19
20
## How It Works
20
21
21
22
The Syslog Health Monitor runs as a DaemonSet on GPU nodes:
0 commit comments