feat:Enhance the alarm kernel with recovered status notification capability for alarm rules #13539

youjie23 · 2025-10-11T10:03:50Z

Add alarm recovery detection with a recovery-observation-period(default 0).
Store the alarm recovery record with the same UUID as the related alarm record.
Notify hooks using a recovery-text-template or recovery-urls. , and the notification includes the recoveryTime.

Submodule PR:

skywalking-booster-ui#505
skywalking-query-protocol#153
If this is non-trivial feature, paste the links/URLs to the design doc.
Update the documentation to include this new feature.
Tests(including UT, IT, E2E) are added to verify the new feature.
If it's UI related, attach the screenshots below.
If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes [Feature] Enhance the alarm kernel with recovered status notification capability for alarm rules. #13492.
Update the CHANGES log.

…apache#13492

youjie23 · 2025-10-11T10:43:09Z

Apologies for the oversight. While merging the latest master code, the @BanyanDB.Group annotation in the AlarmRecoveryRecordclass was accidentally missed, which caused the e2e test failure @wankai123 @wu-sheng
I will fix it immediately and re-run the tests.

wu-sheng · 2025-10-11T10:44:01Z

Take your time.

…apache#13492

wu-sheng · 2025-10-11T13:50:46Z

...a/org/apache/skywalking/oap/server/storage/plugin/banyandb/stream/BanyanDBAlarmQueryDAO.java

+            Long recoveryTime = getAlarmRecoveryTime(alarmRecord.getUuid(), duration);
+            AlarmMessage alarmMessage = buildAlarmMessage(alarmRecord, recoveryTime);


I have concerns about the way you are doing this. Querying status from a list usually results a bad performance.

You should at least get the alarm list first. Then use the UUID list to retrieve the recovery list.

Thank you for the helpful feedback. I've pushed new commits to address the points you raised. Please take another look when you have a moment, and let me know if anything else needs adjustment.

It appears that the e2e test job on the GitHub Actions workflow was blocked and then got canceled. I'm not entirely sure if this is an issue on my end. I've sampled all the alarm e2e tests and some other tests that were not marked as completed; they all seemed to have passed verification. Is there anything I need to do on my side to allow them to run to completion?

…apache#13492

wu-sheng · 2025-10-15T06:57:57Z

Please fix CI.

youjie23 · 2025-10-15T07:21:39Z

Please fix CI.

It appears that the e2e test job on the GitHub Actions workflow was blocked and then got canceled. I've sampled all the alarm e2e tests and some other tests that were not marked as completed; they all seemed to have passed verification. Could you please spare a moment to guide me on what I need to do to get them to run to completion?

wu-sheng · 2025-10-15T10:42:08Z

Are you setting the recovery quickly enough? They are running for over one hour, and be cancelled due to preset timeout

youjie23 · 2025-10-15T10:52:57Z

Are you setting the recovery quickly enough? They are running for over one hour, and be cancelled due to preset timeout

It seems unrelated to the test cases. I observed that some test cases had been verified successfully before the 18-minute mark, but the test did not continue execute. like [E2E test (Alarm ES, test/e2e-v2/cases/alarm/es/e2e.yaml)] (https://github.com/apache/skywalking/actions/runs/18516094658/job/52781047577#logs) which just cost 10minute to detect recovery.
And it’s not just the alarm case that gets stuck. Other verified cases also did not continue to execute. like E2E test (Log FluentBit ES 8.8.1, test/e2e-v2/cases/log/fluent-bit/e2e.yaml, ES_VERSION=8.8.1)

wu-sheng · 2025-10-15T13:19:02Z

Another PR just passed all the tests and merged. I assume if there is anything wrong, it is in this change.

…apache#13492

youjie23 · 2025-10-18T16:37:59Z

Another PR just passed all the tests and merged. I assume if there is anything wrong, it is in this change.

Thank you for the helpful feedback.
Fixed in the skywalking-infra-e2e #133

wu-sheng · 2025-10-19T05:54:03Z

They are not cancelled this tine, but failed.
please take a look.

This reverts commit c4da5d2.

youjie23 · 2025-11-06T09:29:31Z

Let's revert them and rerun.

Reverted.

Copilot

Pull Request Overview

This PR enhances the alarm system to support recovery notifications when alarms are resolved. The implementation adds a recovery observation period (defaulting to 0), stores recovery records with matching UUIDs, and notifies hooks using customizable recovery text templates (default: '[Recovered]' + alarm template).

Key changes:

Added alarm recovery state machine to track alarm lifecycle (NORMAL → FIRING → SILENCED/OBSERVING_RECOVERY → RECOVERED)
Introduced AlarmRecoveryRecord and AlarmRecoveryMessage classes for recovery tracking
Updated alarm callbacks to handle both firing and recovery notifications
Added recoveryTime field to alarm messages and query responses

Reviewed Changes

Copilot reviewed 121 out of 121 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
AlarmRecoveryRecord.java	New record class for storing alarm recovery data with UUID matching
AlarmRecoveryMessage.java	New message class extending AlarmMessage for recovery notifications
RunningRule.java	Added state machine for alarm lifecycle management with recovery observation period
AlarmCore.java	Updated to separate firing and recovery messages for callback processing
Various callback classes	Updated to support recovery notifications with separate templates
Query DAOs	Enhanced to fetch and populate recovery time from AlarmRecoveryRecord
E2E test configs	Updated test configurations to validate recovery functionality

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...m-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRule.java

youjie23 · 2025-11-06T10:43:00Z

Let's revert them and rerun.

After the revert, all e2e cases are now passing. Copilot reported 3 issues regarding the removal of commented-out code lines. Should I create a new commit to address these now, or would you prefer to review first? @wu-sheng

wu-sheng · 2025-11-06T11:01:16Z

Please fix them. Review this still takes some time.

youjie23 · 2025-11-06T11:11:29Z

Please fix them. Review this still takes some time.

Fixed. Please review when you have a moment.

wu-sheng · 2025-11-06T12:38:21Z

Please update the alarm docs accordingly, about how recovery works, what are recovery notification APIs, and how to set those up

wankai123 · 2025-11-07T02:27:57Z

...m-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRule.java

-            // -1 means silence countdown is not running.
-            silenceCountdown = -1;
-            init();
+            this.size = period + additionalPeriod + Math.max(silencePeriod, recoveryObservationPeriod);


Could you explain why you have to change the window size?

Thanks for the review. This change is no longer needed with the latest logic and I've removed it in a new commit.

wankai123 · 2025-11-07T02:49:36Z

...m-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRule.java

+                this.silencePeriod = silencePeriod;
+                this.recoveryObservationPeriod = recoveryObservationPeriod;
+                this.silenceCountdown = -1;
+                this.recoveryObservationCountdown = recoveryObservationPeriod;


If recoveryObservationPeriod is not set, according to the following logic, I think the alarm will trigger the recovery immediately if not match? We need a compatible logical if the user didn't config the recovery settings.

Thank you. You've correctly summarized the logic. As per #13492, the current plan is to recover the alarm immediately when conditions aren't met, with the observation period being configurable (default: 0) to avoid flapping. Does this approach seem reasonable to you? We can adjust if needed.

…apache#13492

youjie23 · 2025-11-09T16:36:02Z

Please update the alarm docs accordingly, about how recovery works, what are recovery notification APIs, and how to set those up

I have updated the corresponding description in the backend-alarm.md document. Please review it at your convenience.

wu-sheng · 2025-11-09T23:51:13Z

docs/en/setup/backend/backend-alarm.md

 - **Silence period**. After the alarm is triggered at Time-N (TN), there will be silence during the **TN -> TN + period**.
 By default, it works in the same manner as **period**. The same Alarm (having the same ID in the same metrics name) may only be triggered once within a period. 
+- **Recovery observation period**. Defines the number of consecutive periods that the alarm condition must remain false before the alarm is considered recovered. When the alarm condition becomes false, the system enters an observation period. If the condition remains false for the specified number of periods, a recovery notification is sent. If the condition becomes true again during the observation period, the alarm returns to the FIRING state. 
+The default value is 0, which means immediate recovery notification when the condition becomes false.


About the default value, we could change the default rules into 0, and are considered as immediately recovery. But for ppl don't have this config(previous versions' users), we are better to support -1 as default value for config absent, which could provide a more consistent behaviour.
After all, you will send new notifications. The old confiiguation files don't have recovery-text-template or relative url, you should take care of them as normal cases. Otherwise, they are going to fail to boot and upgrade, then have to change all rules manually.

Please the make the codes to support recovery period as -1 as no recovery rules. And support no recovery-text-template in hooks as no need to send recovery notifications.

Thanks for the review.
Yes, I also agree that the upgrade should not cause any additional hassle for existing users. I have already addressed this in the latest commit: if neither recovery-text-template nor recovery-urls is configured, no recovery notification will be sent externally (though it will still be persisted to storage), and this will not affect the project's startup or upgrade process.

For example,the key logic in WebhookCallback is as follows: it checks the configured URLs using the getUrls method. For recovery notifications, it specifically uses setting.getRecoveryUrls(). If this list is empty (i.e., not configured), the loop for (final var url : urls)will not execute, thus no external notification is sent.

@Override public void doAlarmCallback(List<AlarmMessage> alarmMessages, boolean isRecovery) throws Exception { // ... existing setup code ... List<String> urls = getUrls(setting, isRecovery); if (setting == null || CollectionUtils.isEmpty(urls) || CollectionUtils.isEmpty(messages)) { continue; // This is where it skips sending if URLs are empty } for (final var url : urls) { // ... send message ... } } private static List<String> getUrls(WebhookSettings setting, boolean isRecovery) { return isRecovery ? setting.getRecoveryUrls() : setting.getUrls(); // Returns an empty list if not configured }

Please the make the codes to support recovery period as -1 as no recovery rules. And support no recovery-text-template in hooks as no need to send recovery notifications.

Sorry. I didn't see this message when I submitted my last reply. I have already implemented the latter behavior (supporting no recovery-text-template in hooks). I'm not entirely sure if the first part (supporting recovery period as -1) is still required. The state transitions and separate storage should not introduce additional side effects, as operations like table creation are automatically handled during the startup process.

wu-sheng · 2025-11-10T04:10:56Z

@wankai123 Let's propose a set of UTs to check alarm status changes, e.g. from alarming -> silence -> recovery.

wu-sheng · 2025-11-10T04:11:55Z

Also, with #13570 is going to be merged, this new status should be reflected into query APIs.

youjie23 · 2025-11-10T04:49:51Z

Also, with #13570 is going to be merged, this new status should be reflected into query APIs.

Please hold off on the CI pipeline due to compilation errors after merging master. I'm looking into the errors and will update everyone once it's fixed.

wankai123 · 2025-11-10T06:18:31Z

Let's add the different cases in the UT and check if the alarm window status changes as expected:

silencePeriod and recoveryObservationPeriod are not set.
Only set silencePeriod.
Only set recoveryObservationPeriod.
silencePeriod > recoveryObservationPeriod.
recoveryObservationPeriod > silencePeriod.

The status changes should include the AlarmStateMachine current status after each match or misMatch

wu-sheng · 2025-11-10T06:29:51Z

I will cut the release of 10.3 today. Let's make this as a part of key of 10.4

…apache#13492

youjie23 · 2025-11-12T10:16:05Z

Let's add the different cases in the UT and check if the alarm window status changes as expected:

silencePeriod and recoveryObservationPeriod are not set.

Only set silencePeriod.

Only set recoveryObservationPeriod.

silencePeriod > recoveryObservationPeriod.

recoveryObservationPeriod > silencePeriod.

The status changes should include the AlarmStateMachine current status after each match or misMatch

I have added the unit tests to cover all the different cases you mentioned. The tests now verify the status changes of the AlarmStateMachineafter each match and misMatch.
The changes are in RunningRuleTest. Please review when you have time. Thanks.

youjie23 · 2025-11-12T10:17:24Z

Also, with #13570 is going to be merged, this new status should be reflected into query APIs.

Done. Please review when you have time. Thanks.

Copilot

Pull Request Overview

Copilot reviewed 132 out of 132 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

oap-server/server-alarm-plugin/src/test/java/org/apache/skywalking/oap/server/core/alarm/provider/wechat/WechatHookCallbackTest.java:1

The test is passing the wrong list to doAlarmRecovery. It should pass alarmRecoveryMessages instead of alarmMessages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-12T13:09:08Z

...in/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/grpc/GRPCCallback.java

        this.alarmRulesWatcher = alarmRulesWatcher;
-        this.alarmSettingMap = new HashMap<>();
        this.alarmServiceStubMap = new HashMap<>();
        this.grpcClientMap = new HashMap<>();


The field alarmSettingMap is not initialized in the constructor before being used. It should be initialized as this.alarmSettingMap = new HashMap<>(); before the conditional block that uses alarmSettingMap.

Suggested change

this.grpcClientMap = new HashMap<>();

this.grpcClientMap = new HashMap<>();

this.alarmSettingMap = new HashMap<>();

Copilot · 2025-11-12T13:09:09Z

...m-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRule.java

+                    if (log.isTraceEnabled()) {
+                        log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}",
+                                ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState);
+                    }


Duplicate if (log.isTraceEnabled()) check on lines 498 and 499. Remove the inner duplicate check.

Suggested change

if (log.isTraceEnabled()) {

log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}",

ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState);

}

log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}",

ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState);

Copilot · 2025-11-12T13:09:09Z

docs/en/setup/backend/backend-alarm.md

+  "ruleName": "service_resp_time_rule",
  "alarmMessage": "alarmMessage xxxx",
  "startTime": 1560524171000,
+  "recoveryTime": 15596606810000, 


The example recovery timestamp 15596606810000 appears to be in the future (approximately year 2464). This should be a realistic timestamp that comes after the startTime value of 1560524171000.

Suggested change

"recoveryTime": 15596606810000,

"recoveryTime": 1560524271000,

Copilot · 2025-11-12T13:09:10Z

...m-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRule.java

+                if (log.isTraceEnabled()) {
+                    log.trace("RuleName:{} AlarmEntity {} {} {} expired", ruleName, alarmEntity.getName(),
+                            alarmEntity.getId0(), alarmEntity.getId1());
+                }


[nitpick] The expired entities are being logged but then removed from the window. The removal happens after the forEach completes. Consider adding a return statement after logging to skip further processing of expired entities in the same iteration.

Suggested change

}

}

return;

youjie23 added 6 commits October 11, 2025 11:06

enhance the alarm kernel with recovered status notification capability …

7481ca8

…apache#13492

enhance the alarm kernel with recovered status notification capability …

4b54c18

…apache#13492

enhance the alarm kernel with recovered status notification capability …

638668f

…apache#13492

enhance the alarm kernel with recovered status notification capability …

0acfbe5

…apache#13492

enhance the alarm kernel with recovered status notification capability …

92cfeed

…apache#13492

enhance the alarm kernel with recovered status notification capability …

edc2722

…apache#13492

wu-sheng requested review from wankai123 and wu-sheng October 11, 2025 10:10

wu-sheng added backend OAP backend related. feature New feature labels Oct 11, 2025

youjie23 added 2 commits October 11, 2025 20:10

enhance the alarm kernel with recovered status notification capability …

a7edf5c

…apache#13492

enhance the alarm kernel with recovered status notification capability …

f140f6e

…apache#13492

wu-sheng reviewed Oct 11, 2025

View reviewed changes

youjie23 added 2 commits October 15, 2025 10:39

enhance the alarm kernel with recovered status notification capability …

cf0570b

…apache#13492

Merge branch 'master' into master

a53f9c2

wu-sheng and others added 4 commits October 15, 2025 21:19

Merge branch 'master' into master

d4ad7c0

enhance the alarm kernel with recovered status notification capability …

5829a48

…apache#13492

Merge branch 'master' of github.com:youjie23/skywalking

9b10401

Merge branch 'master' into master

602262d

youjie23 mentioned this pull request Oct 23, 2025

feat: allow times to be <= 0 to simulate endless trigger apache/skywalking-infra-e2e#134

Merged

youjie23 closed this Oct 25, 2025

Revert "chore:add logs for troubleshooting"

9c8651c

This reverts commit c4da5d2.

wu-sheng requested a review from Copilot November 6, 2025 09:30

Copilot AI reviewed Nov 6, 2025

View reviewed changes

chore: remove the commented-out code

7c2b0f5

wankai123 reviewed Nov 7, 2025

View reviewed changes

youjie23 added 2 commits November 9, 2025 23:50

enhance the alarm kernel with recovered status notification capability …

4dcff48

…apache#13492

enhance the alarm kernel with recovered status notification capability …

5307baf

…apache#13492

wu-sheng reviewed Nov 9, 2025

View reviewed changes

Merge branch 'master' into master

6ff7817

wu-sheng added this to the 10.4.0 milestone Nov 10, 2025

youjie23 added 2 commits November 12, 2025 18:06

enhance the alarm kernel with recovered status notification capability …

ca113a5

…apache#13492

Merge branch 'master' into master

f65414b

wu-sheng requested review from Copilot and wankai123 November 12, 2025 13:02

Copilot AI reviewed Nov 12, 2025

View reviewed changes

fix Copilot review and CI fail

4c1e2c6

		Long recoveryTime = getAlarmRecoveryTime(alarmRecord.getUuid(), duration);
		AlarmMessage alarmMessage = buildAlarmMessage(alarmRecord, recoveryTime);

	this.grpcClientMap = new HashMap<>();
	this.grpcClientMap = new HashMap<>();
	this.alarmSettingMap = new HashMap<>();

	"recoveryTime": 15596606810000,
	"recoveryTime": 1560524271000,

feat:Enhance the alarm kernel with recovered status notification capability for alarm rules #13539

Are you sure you want to change the base?

feat:Enhance the alarm kernel with recovered status notification capability for alarm rules #13539

Conversation

youjie23 commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youjie23 commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wu-sheng commented Oct 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wu-sheng commented Oct 15, 2025

Uh oh!

youjie23 commented Oct 15, 2025

Uh oh!

wu-sheng commented Oct 15, 2025

Uh oh!

youjie23 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wu-sheng commented Oct 15, 2025

Uh oh!

youjie23 commented Oct 18, 2025

Uh oh!

wu-sheng commented Oct 19, 2025

Uh oh!

youjie23 commented Nov 6, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

youjie23 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wu-sheng commented Nov 6, 2025

Uh oh!

youjie23 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wu-sheng commented Nov 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youjie23 commented Nov 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youjie23 Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youjie23 Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wu-sheng commented Nov 10, 2025

Uh oh!

wu-sheng commented Nov 10, 2025

youjie23 commented Oct 11, 2025 •

edited

Loading

youjie23 commented Oct 11, 2025 •

edited

Loading

youjie23 commented Oct 15, 2025 •

edited

Loading

youjie23 commented Nov 6, 2025 •

edited

Loading

youjie23 commented Nov 6, 2025 •

edited

Loading

youjie23 Nov 10, 2025 •

edited

Loading

youjie23 Nov 10, 2025 •

edited

Loading

wankai123 commented Nov 10, 2025 •

edited

Loading

youjie23 commented Nov 12, 2025 •

edited

Loading