Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert only on recent SEL entries #28

Open
aieri opened this issue Jun 21, 2019 · 8 comments
Open

Alert only on recent SEL entries #28

aieri opened this issue Jun 21, 2019 · 8 comments

Comments

@aieri
Copy link

aieri commented Jun 21, 2019

I like the current support for alerting on SEL entries, but I find it pushes us towards a suboptimal pattern: since we value such alerts, we are forced to keep the SEL empty. On the other hand, it can be convenient to retain a local history of failures for when you inevitably end up in one of those situations where you're stuck wondering whether you should look into a replacement because "isn't this the server that keeps crashing?".
I'd propose a new config option to specify an age limit beyond which events are ignored. For example, I could say check_ipmi_sensors --selcutoff 24 to indicate that any SEL entry that is older than 1 day will be ignored.

@Tejeev
Copy link

Tejeev commented Jun 21, 2019

I've been struggling with this same issue and tried documenting in comments in Thruk (but those quickly become unmanageable and sometimes get cleaned up without any other tracking), and cases (which become lost in the fog of caselog).

I think the age limit makes sense for us. Investigate the alert and downtime if innocuous and we keep the log while the alert becomes active again tomorrow.
That said, I forsee lots of these being downtimed for 24 hours so we might miss subsiquent failures during that time. Is it possible to have a check command that sets the current number of allowed SEL enteries to current so new ones will alert? Then we can use that command to affectively acknowledge the alerts once we've checked and retain history.

@aieri
Copy link
Author

aieri commented Jun 21, 2019

Though saying "alert if SEL has more than n entries" would generally work, it wouldn't really scale: every server would need a different value and would result in a configuration management nightmare.
If you want to dream big, you could imagine alerting only for events that have not been deasserted. The problem there is that although some events do come in pair (e.g. voltage too high / ok it's fine now), others do not (e.g. CPU n threw some error), so you'd need to build a lot knowledge in this check. Doable, but quite a bit of work.

@aieri
Copy link
Author

aieri commented Jun 22, 2019

Ok, actually... there would be a third way to solve this: imagine a command like check_ipmi_sensors acksel that persisted the latest SEL entry on disk. Subsequent calls to check_ipmi_sensors would alert only if there are entries newer than the cached value.
The downside is that this would make check_ipmi_sensors stateful, whereas nagios plugins are generally stateless. I don't know if the upstream devs would be ok with going down this route.

@tniedermeier
Copy link
Member

Hello @aieri and @Tejeev,

I'm so sorry for my hugely delayed reply!
Maybe you could give this option from ipmi-sel a try, just add the following parameter to your plugin call:
--seloptions '--date-range=09/01/2019-now'

Using this parameter, the check_ipmi_sensor plugin only displays SEL entries and alerts for events occured in the time range from today back to the first of September.

If there were no SEL entries in that specified time range, the plugin returns OK.
More options for ipmi-sel: http://manpages.ubuntu.com/manpages/trusty/man8/ipmi-sel.8.html#ipmi-sel%20options

Example:

$ ./check_ipmi_sensor -H -U -P -L admin --seloptions '--date-range=09/10/2019-now'
IPMI Status: OK | 'CPU Temp'=46.00;0.00:95.00;0.00:100.00 [...]

$ ./check_ipmi_sensor -H -U -P -L admin --seloptions '--date-range=09/01/2019-now'
IPMI Status: Critical [1 system event log (SEL) entry present] | 'CPU Temp'=47.00;0.00:95.00;0.00:100.00 [...]

I hope this helps.

Best regards,
Thomas

@afreiberger
Copy link

I'm wondering if having an "acknowledged sel entries" file that gets passed to --selexclude might be the best way to keep entries in SEL but ignore them.

Obviously, anything wrapped around this check will need a way to populate and clear that file when issues have been mitigated.

@Tejeev
Copy link

Tejeev commented Oct 16, 2020

@tniedermeier I'm afraid I've moved on and no longer use this tooling. I do remember this being a major pain for me when working in operational response so while I can't offer any constructive communication on this, I would like to note that I still strongly +1 the investigation for a resolution to save all those that still use it. The workflows that grew up around this definitely hurt operations and colored my view of this alerting solution.

@BrixSat
Copy link

BrixSat commented Jan 11, 2023

Hello,

The feature to ignore the ACK events in the sel is a major thing. It should be able to show all events or only new ones. Its imperative to keep the sel with all events, so in the future we can know what was wrong with the machine.

Hope this helps.

@graham-collinson
Copy link

graham-collinson commented Mar 24, 2023

As a quick solution for now I'm using a wrapper script to only look at logs for today:

#!/bin/bash
#MM/DD/YYYY used by ipme-sel command
yest=`date +%m/%d/%Y -d yesterday`
/usr/local/nagios/libexec/check_ipmi_sensor "$@" --seloptions "--date-range=$yest-now"
exitcode=$?
exit $exitcode

There's a chance that something will hit the log just before midnight and not get picked up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants