-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alert only on recent SEL entries #28
Comments
I've been struggling with this same issue and tried documenting in comments in Thruk (but those quickly become unmanageable and sometimes get cleaned up without any other tracking), and cases (which become lost in the fog of caselog). I think the age limit makes sense for us. Investigate the alert and downtime if innocuous and we keep the log while the alert becomes active again tomorrow. |
Though saying "alert if SEL has more than n entries" would generally work, it wouldn't really scale: every server would need a different value and would result in a configuration management nightmare. |
Ok, actually... there would be a third way to solve this: imagine a command like |
I'm so sorry for my hugely delayed reply! Using this parameter, the check_ipmi_sensor plugin only displays SEL entries and alerts for events occured in the time range from today back to the first of September. If there were no SEL entries in that specified time range, the plugin returns OK. Example: $ ./check_ipmi_sensor -H -U -P -L admin --seloptions '--date-range=09/10/2019-now' $ ./check_ipmi_sensor -H -U -P -L admin --seloptions '--date-range=09/01/2019-now' I hope this helps. Best regards, |
I'm wondering if having an "acknowledged sel entries" file that gets passed to --selexclude might be the best way to keep entries in SEL but ignore them. Obviously, anything wrapped around this check will need a way to populate and clear that file when issues have been mitigated. |
@tniedermeier I'm afraid I've moved on and no longer use this tooling. I do remember this being a major pain for me when working in operational response so while I can't offer any constructive communication on this, I would like to note that I still strongly +1 the investigation for a resolution to save all those that still use it. The workflows that grew up around this definitely hurt operations and colored my view of this alerting solution. |
Hello, The feature to ignore the ACK events in the sel is a major thing. It should be able to show all events or only new ones. Its imperative to keep the sel with all events, so in the future we can know what was wrong with the machine. Hope this helps. |
As a quick solution for now I'm using a wrapper script to only look at logs for today: #!/bin/bash There's a chance that something will hit the log just before midnight and not get picked up. |
I like the current support for alerting on SEL entries, but I find it pushes us towards a suboptimal pattern: since we value such alerts, we are forced to keep the SEL empty. On the other hand, it can be convenient to retain a local history of failures for when you inevitably end up in one of those situations where you're stuck wondering whether you should look into a replacement because "isn't this the server that keeps crashing?".
I'd propose a new config option to specify an age limit beyond which events are ignored. For example, I could say
check_ipmi_sensors --selcutoff 24
to indicate that any SEL entry that is older than 1 day will be ignored.The text was updated successfully, but these errors were encountered: