-
Notifications
You must be signed in to change notification settings - Fork 6.6k
feat:Enhance the alarm kernel with recovered status notification capability for alarm rules #13539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Apologies for the oversight. While merging the latest master code, the @BanyanDB.Group annotation in the AlarmRecoveryRecordclass was accidentally missed, which caused the e2e test failure @wankai123 @wu-sheng |
|
Take your time. |
| Long recoveryTime = getAlarmRecoveryTime(alarmRecord.getUuid(), duration); | ||
| AlarmMessage alarmMessage = buildAlarmMessage(alarmRecord, recoveryTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns about the way you are doing this. Querying status from a list usually results a bad performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should at least get the alarm list first. Then use the UUID list to retrieve the recovery list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the helpful feedback. I've pushed new commits to address the points you raised. Please take another look when you have a moment, and let me know if anything else needs adjustment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that the e2e test job on the GitHub Actions workflow was blocked and then got canceled. I'm not entirely sure if this is an issue on my end. I've sampled all the alarm e2e tests and some other tests that were not marked as completed; they all seemed to have passed verification. Is there anything I need to do on my side to allow them to run to completion?
|
Please fix CI. |
It appears that the e2e test job on the GitHub Actions workflow was blocked and then got canceled. I've sampled all the alarm e2e tests and some other tests that were not marked as completed; they all seemed to have passed verification. Could you please spare a moment to guide me on what I need to do to get them to run to completion? |
|
Are you setting the recovery quickly enough? They are running for over one hour, and be cancelled due to preset timeout |
It seems unrelated to the test cases. I observed that some test cases had been verified successfully before the 18-minute mark, but the test did not continue execute. like [E2E test (Alarm ES, test/e2e-v2/cases/alarm/es/e2e.yaml)] (https://github.com/apache/skywalking/actions/runs/18516094658/job/52781047577#logs) which just cost 10minute to detect recovery. |
|
Another PR just passed all the tests and merged. I assume if there is anything wrong, it is in this change. |
Thank you for the helpful feedback. |
|
They are not cancelled this tine, but failed. |
This reverts commit c4da5d2.
Reverted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the alarm system to support recovery notifications when alarms are resolved. The implementation adds a recovery observation period (defaulting to 0), stores recovery records with matching UUIDs, and notifies hooks using customizable recovery text templates (default: '[Recovered]' + alarm template).
Key changes:
- Added alarm recovery state machine to track alarm lifecycle (NORMAL → FIRING → SILENCED/OBSERVING_RECOVERY → RECOVERED)
- Introduced
AlarmRecoveryRecordandAlarmRecoveryMessageclasses for recovery tracking - Updated alarm callbacks to handle both firing and recovery notifications
- Added
recoveryTimefield to alarm messages and query responses
Reviewed Changes
Copilot reviewed 121 out of 121 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| AlarmRecoveryRecord.java | New record class for storing alarm recovery data with UUID matching |
| AlarmRecoveryMessage.java | New message class extending AlarmMessage for recovery notifications |
| RunningRule.java | Added state machine for alarm lifecycle management with recovery observation period |
| AlarmCore.java | Updated to separate firing and recovery messages for callback processing |
| Various callback classes | Updated to support recovery notifications with separate templates |
| Query DAOs | Enhanced to fetch and populate recovery time from AlarmRecoveryRecord |
| E2E test configs | Updated test configurations to validate recovery functionality |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...m-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRule.java
Outdated
Show resolved
Hide resolved
...m-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRule.java
Outdated
Show resolved
Hide resolved
...m-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRule.java
Outdated
Show resolved
Hide resolved
After the revert, all e2e cases are now passing. Copilot reported 3 issues regarding the removal of commented-out code lines. Should I create a new commit to address these now, or would you prefer to review first? @wu-sheng |
|
Please fix them. Review this still takes some time. |
Fixed. Please review when you have a moment. |
|
Please update the alarm docs accordingly, about how recovery works, what are recovery notification APIs, and how to set those up |
| // -1 means silence countdown is not running. | ||
| silenceCountdown = -1; | ||
| init(); | ||
| this.size = period + additionalPeriod + Math.max(silencePeriod, recoveryObservationPeriod); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why you have to change the window size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. This change is no longer needed with the latest logic and I've removed it in a new commit.
| this.silencePeriod = silencePeriod; | ||
| this.recoveryObservationPeriod = recoveryObservationPeriod; | ||
| this.silenceCountdown = -1; | ||
| this.recoveryObservationCountdown = recoveryObservationPeriod; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If recoveryObservationPeriod is not set, according to the following logic, I think the alarm will trigger the recovery immediately if not match? We need a compatible logical if the user didn't config the recovery settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. You've correctly summarized the logic. As per #13492, the current plan is to recover the alarm immediately when conditions aren't met, with the observation period being configurable (default: 0) to avoid flapping. Does this approach seem reasonable to you? We can adjust if needed.
I have updated the corresponding description in the |
| - **Silence period**. After the alarm is triggered at Time-N (TN), there will be silence during the **TN -> TN + period**. | ||
| By default, it works in the same manner as **period**. The same Alarm (having the same ID in the same metrics name) may only be triggered once within a period. | ||
| - **Recovery observation period**. Defines the number of consecutive periods that the alarm condition must remain false before the alarm is considered recovered. When the alarm condition becomes false, the system enters an observation period. If the condition remains false for the specified number of periods, a recovery notification is sent. If the condition becomes true again during the observation period, the alarm returns to the FIRING state. | ||
| The default value is 0, which means immediate recovery notification when the condition becomes false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the default value, we could change the default rules into 0, and are considered as immediately recovery. But for ppl don't have this config(previous versions' users), we are better to support -1 as default value for config absent, which could provide a more consistent behaviour.
After all, you will send new notifications. The old confiiguation files don't have recovery-text-template or relative url, you should take care of them as normal cases. Otherwise, they are going to fail to boot and upgrade, then have to change all rules manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please the make the codes to support recovery period as -1 as no recovery rules. And support no recovery-text-template in hooks as no need to send recovery notifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review.
Yes, I also agree that the upgrade should not cause any additional hassle for existing users. I have already addressed this in the latest commit: if neither recovery-text-template nor recovery-urls is configured, no recovery notification will be sent externally (though it will still be persisted to storage), and this will not affect the project's startup or upgrade process.
For example,the key logic in WebhookCallback is as follows: it checks the configured URLs using the getUrls method. For recovery notifications, it specifically uses setting.getRecoveryUrls(). If this list is empty (i.e., not configured), the loop for (final var url : urls)will not execute, thus no external notification is sent.
@Override
public void doAlarmCallback(List<AlarmMessage> alarmMessages, boolean isRecovery) throws Exception {
// ... existing setup code ...
List<String> urls = getUrls(setting, isRecovery);
if (setting == null || CollectionUtils.isEmpty(urls) || CollectionUtils.isEmpty(messages)) {
continue; // This is where it skips sending if URLs are empty
}
for (final var url : urls) {
// ... send message ...
}
}
private static List<String> getUrls(WebhookSettings setting, boolean isRecovery) {
return isRecovery ? setting.getRecoveryUrls() : setting.getUrls(); // Returns an empty list if not configured
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please the make the codes to support recovery period as -1 as no recovery rules. And support no
recovery-text-templatein hooks as no need to send recovery notifications.
Sorry. I didn't see this message when I submitted my last reply. I have already implemented the latter behavior (supporting no recovery-text-template in hooks). I'm not entirely sure if the first part (supporting recovery period as -1) is still required. The state transitions and separate storage should not introduce additional side effects, as operations like table creation are automatically handled during the startup process.
|
@wankai123 Let's propose a set of UTs to check alarm status changes, e.g. from alarming -> silence -> recovery. |
|
Also, with #13570 is going to be merged, this new status should be reflected into query APIs. |
Please hold off on the CI pipeline due to compilation errors after merging master. I'm looking into the errors and will update everyone once it's fixed. |
|
Let's add the different cases in the UT and check if the alarm window status changes as expected:
The status changes should include the |
|
I will cut the release of 10.3 today. Let's make this as a part of key of 10.4 |
I have added the unit tests to cover all the different cases you mentioned. The tests now verify the status changes of the AlarmStateMachineafter each match and misMatch. |
Done. Please review when you have time. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 132 out of 132 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
oap-server/server-alarm-plugin/src/test/java/org/apache/skywalking/oap/server/core/alarm/provider/wechat/WechatHookCallbackTest.java:1
- The test is passing the wrong list to
doAlarmRecovery. It should passalarmRecoveryMessagesinstead ofalarmMessages.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| this.alarmRulesWatcher = alarmRulesWatcher; | ||
| this.alarmSettingMap = new HashMap<>(); | ||
| this.alarmServiceStubMap = new HashMap<>(); | ||
| this.grpcClientMap = new HashMap<>(); |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The field alarmSettingMap is not initialized in the constructor before being used. It should be initialized as this.alarmSettingMap = new HashMap<>(); before the conditional block that uses alarmSettingMap.
| this.grpcClientMap = new HashMap<>(); | |
| this.grpcClientMap = new HashMap<>(); | |
| this.alarmSettingMap = new HashMap<>(); |
| if (log.isTraceEnabled()) { | ||
| log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}", | ||
| ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState); | ||
| } |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate if (log.isTraceEnabled()) check on lines 498 and 499. Remove the inner duplicate check.
| if (log.isTraceEnabled()) { | |
| log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}", | |
| ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState); | |
| } | |
| log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}", | |
| ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState); |
| "ruleName": "service_resp_time_rule", | ||
| "alarmMessage": "alarmMessage xxxx", | ||
| "startTime": 1560524171000, | ||
| "recoveryTime": 15596606810000, |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example recovery timestamp 15596606810000 appears to be in the future (approximately year 2464). This should be a realistic timestamp that comes after the startTime value of 1560524171000.
| "recoveryTime": 15596606810000, | |
| "recoveryTime": 1560524271000, |
| if (log.isTraceEnabled()) { | ||
| log.trace("RuleName:{} AlarmEntity {} {} {} expired", ruleName, alarmEntity.getName(), | ||
| alarmEntity.getId0(), alarmEntity.getId1()); | ||
| } |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The expired entities are being logged but then removed from the window. The removal happens after the forEach completes. Consider adding a return statement after logging to skip further processing of expired entities in the same iteration.
| } | |
| } | |
| return; |
Submodule PR:
skywalking-booster-ui#505
skywalking-query-protocol#153
If this is non-trivial feature, paste the links/URLs to the design doc.
Update the documentation to include this new feature.
Tests(including UT, IT, E2E) are added to verify the new feature.
If it's UI related, attach the screenshots below.
If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes [Feature] Enhance the alarm kernel with recovered status notification capability for alarm rules. #13492.
Update the
CHANGESlog.