Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Power supply status doesn't properly detect an uninstalled power supply, and plugin doesn't see them in SEL either #66

Open
thegushi opened this issue Dec 9, 2024 · 1 comment

Comments

@thegushi
Copy link

thegushi commented Dec 9, 2024

We have a supermicro system with two power supplies, and there are two distinct issues, one of which can possibly be blamed on Supermicro. To test this plugin, we had our datacenter staff both physically pull the PSU out of one of our servers, as well as leaving it inserted, but remove the AC plug.

  1. When a power supply is removed (pulled out of the system), it is shown as a not-available sensor, and the supermicro web UI doesn't treat this as an error. There's no option in this plugin, however, to say "two power supplies must read as ok".

Here's what I see in the output for that:

  output of FreeIPMI:
ID   | Name            | Type              | State    | Reading    | Units | Lower NR   | Lower C    | Lower NC   | Upper NC   | Upper C    | Upper NR   | Event
4    | CPU1 Temp       | Temperature       | Nominal  | 40.00      | C     | 0.00       | 0.00       | 0.00       | 79.00      | 82.00      | 84.00      | 'OK'
71   | CPU2 Temp       | Temperature       | Nominal  | 40.00      | C     | 0.00       | 0.00       | 0.00       | 79.00      | 82.00      | 84.00      | 'OK'
138  | System Temp     | Temperature       | Nominal  | 34.00      | C     | -9.00      | -7.00      | -5.00      | 80.00      | 85.00      | 90.00      | 'OK'
205  | Peripheral Temp | Temperature       | Nominal  | 41.00      | C     | -9.00      | -7.00      | -5.00      | 80.00      | 85.00      | 90.00      | 'OK'
272  | PCH Temp        | Temperature       | Nominal  | 48.00      | C     | -11.00     | -8.00      | -5.00      | 90.00      | 95.00      | 100.00     | 'OK'
339  | FAN1            | Fan               | Nominal  | 3800.00    | RPM   | 300.00     | 500.00     | 700.00     | 25500.00   | 25500.00   | 25500.00   | 'OK'
406  | FAN2            | Fan               | Nominal  | 3600.00    | RPM   | 300.00     | 500.00     | 700.00     | 25500.00   | 25500.00   | 25500.00   | 'OK'
473  | FAN3            | Fan               | Nominal  | 3800.00    | RPM   | 300.00     | 500.00     | 700.00     | 25500.00   | 25500.00   | 25500.00   | 'OK'
607  | FAN5            | Fan               | Nominal  | 3800.00    | RPM   | 300.00     | 500.00     | 700.00     | 25500.00   | 25500.00   | 25500.00   | 'OK'
674  | FAN6            | Fan               | Nominal  | 3600.00    | RPM   | 300.00     | 500.00     | 700.00     | 25500.00   | 25500.00   | 25500.00   | 'OK'
875  | VTT             | Voltage           | Nominal  | 0.99       | V     | 0.82       | 0.86       | 0.91       | 1.34       | 1.39       | 1.44       | 'OK'
942  | +1.1 V          | Voltage           | Nominal  | 1.09       | V     | 0.88       | 0.93       | 0.98       | 1.22       | 1.26       | 1.31       | 'OK'
1009 | +1.5 V          | Voltage           | Nominal  | 1.49       | V     | 1.25       | 1.30       | 1.34       | 1.65       | 1.70       | 1.74       | 'OK'
1076 | CPU1 Vcore      | Voltage           | Nominal  | 0.80       | V     | 0.48       | 0.51       | 0.54       | 1.49       | 1.52       | 1.55       | 'OK'
1143 | CPU2 Vcore      | Voltage           | Nominal  | 0.80       | V     | 0.48       | 0.51       | 0.54       | 1.49       | 1.52       | 1.55       | 'OK'
1210 | VDIMM AB        | Voltage           | Nominal  | 1.49       | V     | 1.10       | 1.15       | 1.20       | 1.65       | 1.70       | 1.74       | 'OK'
1277 | VDIMM CD        | Voltage           | Nominal  | 1.50       | V     | 1.10       | 1.15       | 1.20       | 1.65       | 1.70       | 1.74       | 'OK'
1344 | VDIMM EF        | Voltage           | Nominal  | 1.50       | V     | 1.10       | 1.15       | 1.20       | 1.65       | 1.70       | 1.74       | 'OK'
1411 | VDIMM GH        | Voltage           | Nominal  | 1.50       | V     | 1.10       | 1.15       | 1.20       | 1.65       | 1.70       | 1.74       | 'OK'
1478 | 3.3V            | Voltage           | Nominal  | 3.36       | V     | 2.64       | 2.78       | 2.93       | 3.65       | 3.79       | 3.94       | 'OK'
1545 | +3.3VSB         | Voltage           | Nominal  | 3.36       | V     | 2.64       | 2.78       | 2.93       | 3.65       | 3.79       | 3.94       | 'OK'
1612 | 5V              | Voltage           | Nominal  | 5.06       | V     | 4.10       | 4.29       | 4.48       | 5.50       | 5.70       | 6.91       | 'OK'
1679 | +5VSB           | Voltage           | Nominal  | 4.99       | V     | 4.10       | 4.29       | 4.48       | 5.50       | 5.70       | 6.91       | 'OK'
1746 | 12V             | Voltage           | Nominal  | 12.08      | V     | 10.18      | 10.49      | 10.81      | 13.25      | 13.57      | 13.89      | 'OK'
1813 | VBAT            | Voltage           | Nominal  | 3.22       | V     | 2.40       | 2.54       | 2.69       | 3.31       | 3.46       | 3.60       | 'OK'
1880 | Chassis Intru   | Physical Security | Nominal  | N/A        | N/A   | N/A        | N/A        | N/A        | N/A        | N/A        | N/A        | 'General Chassis Intrusion'
1947 | PS1 Status      | Power Supply      | Nominal  | N/A        | N/A   | N/A        | N/A        | N/A        | N/A        | N/A        | N/A        | 'Presence detected'
2483 | PS2 Status      | Power Supply      | Nominal  | N/A        | N/A   | N/A        | N/A        | N/A        | N/A        | N/A        | N/A        | 'OK'

--------------------- end of debug output ---------------------

However, when the power supply is present but has no input voltage, it shows like this:

  output of FreeIPMI:
ID   | Name            | Type              | State    | Reading    | Units | Lower NR   | Lower C    | Lower NC   | Upper NC   | Upper C    | Upper NR   | Event
4    | CPU1 Temp       | Temperature       | Nominal  | 41.00      | C     | 0.00       | 0.00       | 0.00       | 79.00      | 82.00      | 84.00      | 'OK'
71   | CPU2 Temp       | Temperature       | Nominal  | 39.00      | C     | 0.00       | 0.00       | 0.00       | 79.00      | 82.00      | 84.00      | 'OK'
138  | System Temp     | Temperature       | Nominal  | 35.00      | C     | -9.00      | -7.00      | -5.00      | 80.00      | 85.00      | 90.00      | 'OK'
205  | Peripheral Temp | Temperature       | Nominal  | 41.00      | C     | -9.00      | -7.00      | -5.00      | 80.00      | 85.00      | 90.00      | 'OK'
272  | PCH Temp        | Temperature       | Nominal  | 49.00      | C     | -11.00     | -8.00      | -5.00      | 90.00      | 95.00      | 100.00     | 'OK'
339  | FAN1            | Fan               | Nominal  | 3800.00    | RPM   | 300.00     | 500.00     | 700.00     | 25500.00   | 25500.00   | 25500.00   | 'OK'
406  | FAN2            | Fan               | Nominal  | 3600.00    | RPM   | 300.00     | 500.00     | 700.00     | 25500.00   | 25500.00   | 25500.00   | 'OK'
473  | FAN3            | Fan               | Nominal  | 3800.00    | RPM   | 300.00     | 500.00     | 700.00     | 25500.00   | 25500.00   | 25500.00   | 'OK'
607  | FAN5            | Fan               | Nominal  | 3800.00    | RPM   | 300.00     | 500.00     | 700.00     | 25500.00   | 25500.00   | 25500.00   | 'OK'
674  | FAN6            | Fan               | Nominal  | 3600.00    | RPM   | 300.00     | 500.00     | 700.00     | 25500.00   | 25500.00   | 25500.00   | 'OK'
875  | VTT             | Voltage           | Nominal  | 0.99       | V     | 0.82       | 0.86       | 0.91       | 1.34       | 1.39       | 1.44       | 'OK'
942  | +1.1 V          | Voltage           | Nominal  | 1.09       | V     | 0.88       | 0.93       | 0.98       | 1.22       | 1.26       | 1.31       | 'OK'
1009 | +1.5 V          | Voltage           | Nominal  | 1.49       | V     | 1.25       | 1.30       | 1.34       | 1.65       | 1.70       | 1.74       | 'OK'
1076 | CPU1 Vcore      | Voltage           | Nominal  | 0.80       | V     | 0.48       | 0.51       | 0.54       | 1.49       | 1.52       | 1.55       | 'OK'
1143 | CPU2 Vcore      | Voltage           | Nominal  | 0.80       | V     | 0.48       | 0.51       | 0.54       | 1.49       | 1.52       | 1.55       | 'OK'
1210 | VDIMM AB        | Voltage           | Nominal  | 1.49       | V     | 1.10       | 1.15       | 1.20       | 1.65       | 1.70       | 1.74       | 'OK'
1277 | VDIMM CD        | Voltage           | Nominal  | 1.50       | V     | 1.10       | 1.15       | 1.20       | 1.65       | 1.70       | 1.74       | 'OK'
1344 | VDIMM EF        | Voltage           | Nominal  | 1.50       | V     | 1.10       | 1.15       | 1.20       | 1.65       | 1.70       | 1.74       | 'OK'
1411 | VDIMM GH        | Voltage           | Nominal  | 1.50       | V     | 1.10       | 1.15       | 1.20       | 1.65       | 1.70       | 1.74       | 'OK'
1478 | 3.3V            | Voltage           | Nominal  | 3.36       | V     | 2.64       | 2.78       | 2.93       | 3.65       | 3.79       | 3.94       | 'OK'
1545 | +3.3VSB         | Voltage           | Nominal  | 3.36       | V     | 2.64       | 2.78       | 2.93       | 3.65       | 3.79       | 3.94       | 'OK'
1612 | 5V              | Voltage           | Nominal  | 5.06       | V     | 4.10       | 4.29       | 4.48       | 5.50       | 5.70       | 6.91       | 'OK'
1679 | +5VSB           | Voltage           | Nominal  | 4.99       | V     | 4.10       | 4.29       | 4.48       | 5.50       | 5.70       | 6.91       | 'OK'
1746 | 12V             | Voltage           | Nominal  | 12.08      | V     | 10.18      | 10.49      | 10.81      | 13.25      | 13.57      | 13.89      | 'OK'
1813 | VBAT            | Voltage           | Nominal  | 3.22       | V     | 2.40       | 2.54       | 2.69       | 3.31       | 3.46       | 3.60       | 'OK'
1880 | Chassis Intru   | Physical Security | Nominal  | N/A        | N/A   | N/A        | N/A        | N/A        | N/A        | N/A        | N/A        | 'General Chassis Intrusion'
1947 | PS1 Status      | Power Supply      | Nominal  | N/A        | N/A   | N/A        | N/A        | N/A        | N/A        | N/A        | N/A        | 'Presence detected'
2483 | PS2 Status      | Power Supply      | Nominal  | N/A        | N/A   | N/A        | N/A        | N/A        | N/A        | N/A        | N/A        | 'Presence detected' 'Power Supply Failure detected' 'Power Supply input lost (AC/DC)'

--------------------- end of debug output ---------------------

This is absolutely not "nominal". This machine is in a data center beeping, the web UI reports a critical error and I believe the red LED on the front is illuminated on the front. This could be helped by allowing me to specify that "ok" is not a valid state for a given class of device. (In this case, "presence detected" is the good state, as ok can mean "not there but that's fine")

I suspect that it is reported as "nominal" because there are no thresholds reported by the BMC for this value.

The power-supply problem does show up in the SEL:
ID | Date | Time | Name | Type | State | Event
1 | Nov-25-2024 | 21:47:47 | System Chassis Chassis Intru | Physical Security | Critical | General Chassis Intrusion
2 | Dec-09-2024 | 20:45:38 | System Chassis Chassis Intru | Physical Security | Critical | General Chassis Intrusion
3 | Dec-09-2024 | 21:12:29 | Power Supply 2 PS2 Status | Power Supply | Critical | Power Supply Failure detected

However, the plugin only calls the SEL with --sensor-types=Memory,Processor (hardcoded, not override-able on the command line).

The easiest fix to this second failure mode would be to not hardcode the sensor-types, but that would not solve the first case, where a power supply was simply removed. (That doesn't even show in the SEL).

@thegushi
Copy link
Author

thegushi commented Dec 9, 2024

As a datapoint, ipmitool does show raw, discrete values for these, but I have no idea how to make ipmi-sensors use them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant