Skip to content

Conversation

@youjie23
Copy link

@youjie23 youjie23 commented Oct 11, 2025

  • ​​Add​​ alarm recovery detection with a recovery-observation-period(default 0).
  • Store​​ the alarm recovery record with the same UUID as the related alarm record.
  • Notify​​ hooks using a recovery-text-template or recovery-urls. , and the notification includes the recoveryTime.

Submodule PR:

@wu-sheng wu-sheng added backend OAP backend related. feature New feature labels Oct 11, 2025
@youjie23
Copy link
Author

youjie23 commented Oct 11, 2025

Apologies for the oversight. While merging the latest master code, the @BanyanDB.Group annotation in the AlarmRecoveryRecordclass was accidentally missed, which caused the e2e test failure @wankai123 @wu-sheng
​​I will fix it immediately and re-run the tests.​

@wu-sheng
Copy link
Member

Take your time.

Comment on lines 96 to 97
Long recoveryTime = getAlarmRecoveryTime(alarmRecord.getUuid(), duration);
AlarmMessage alarmMessage = buildAlarmMessage(alarmRecord, recoveryTime);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns about the way you are doing this. Querying status from a list usually results a bad performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should at least get the alarm list first. Then use the UUID list to retrieve the recovery list.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the helpful feedback. I've pushed new commits to address the points you raised. Please take another look when you have a moment, and let me know if anything else needs adjustment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the e2e test job on the GitHub Actions workflow was blocked and then got canceled. I'm not entirely sure if this is an issue on my end. I've sampled all the alarm e2e tests and some other tests that were not marked as completed; they all seemed to have passed verification. Is there anything I need to do on my side to allow them to run to completion?

@wu-sheng
Copy link
Member

Please fix CI.

@youjie23
Copy link
Author

Please fix CI.

It appears that the e2e test job on the GitHub Actions workflow was blocked and then got canceled. I've sampled all the alarm e2e tests and some other tests that were not marked as completed; they all seemed to have passed verification. Could you please spare a moment to guide me on what I need to do to get them to run to completion?

@wu-sheng
Copy link
Member

Are you setting the recovery quickly enough? They are running for over one hour, and be cancelled due to preset timeout

@youjie23
Copy link
Author

youjie23 commented Oct 15, 2025

Are you setting the recovery quickly enough? They are running for over one hour, and be cancelled due to preset timeout

It seems unrelated to the test cases. I observed that some test cases had been verified successfully before the 18-minute mark, but the test did not continue execute. like [E2E test (Alarm ES, test/e2e-v2/cases/alarm/es/e2e.yaml)] (https://github.com/apache/skywalking/actions/runs/18516094658/job/52781047577#logs) which just cost 10minute to detect recovery.
And it’s not just the alarm case that gets stuck. Other verified cases also did not continue to execute. like E2E test (Log FluentBit ES 8.8.1, test/e2e-v2/cases/log/fluent-bit/e2e.yaml, ES_VERSION=8.8.1)

@wu-sheng
Copy link
Member

Another PR just passed all the tests and merged. I assume if there is anything wrong, it is in this change.

@youjie23
Copy link
Author

Another PR just passed all the tests and merged. I assume if there is anything wrong, it is in this change.

Thank you for the helpful feedback.
Fixed in the skywalking-infra-e2e #133

@wu-sheng
Copy link
Member

They are not cancelled this tine, but failed.
please take a look.

@youjie23
Copy link
Author

youjie23 commented Nov 6, 2025

Let's revert them and rerun.

Reverted.

@wu-sheng wu-sheng requested a review from Copilot November 6, 2025 09:30
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the alarm system to support recovery notifications when alarms are resolved. The implementation adds a recovery observation period (defaulting to 0), stores recovery records with matching UUIDs, and notifies hooks using customizable recovery text templates (default: '[Recovered]' + alarm template).

Key changes:

  • Added alarm recovery state machine to track alarm lifecycle (NORMAL → FIRING → SILENCED/OBSERVING_RECOVERY → RECOVERED)
  • Introduced AlarmRecoveryRecord and AlarmRecoveryMessage classes for recovery tracking
  • Updated alarm callbacks to handle both firing and recovery notifications
  • Added recoveryTime field to alarm messages and query responses

Reviewed Changes

Copilot reviewed 121 out of 121 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
AlarmRecoveryRecord.java New record class for storing alarm recovery data with UUID matching
AlarmRecoveryMessage.java New message class extending AlarmMessage for recovery notifications
RunningRule.java Added state machine for alarm lifecycle management with recovery observation period
AlarmCore.java Updated to separate firing and recovery messages for callback processing
Various callback classes Updated to support recovery notifications with separate templates
Query DAOs Enhanced to fetch and populate recovery time from AlarmRecoveryRecord
E2E test configs Updated test configurations to validate recovery functionality

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@youjie23
Copy link
Author

youjie23 commented Nov 6, 2025

Let's revert them and rerun.

After the revert, all e2e cases are now passing. Copilot reported 3 issues regarding the removal of commented-out code lines. Should I create a new commit to address these now, or would you prefer to review first? @wu-sheng

@wu-sheng
Copy link
Member

wu-sheng commented Nov 6, 2025

Please fix them. Review this still takes some time.

@youjie23
Copy link
Author

youjie23 commented Nov 6, 2025

Please fix them. Review this still takes some time.

Fixed. Please review when you have a moment.

@wu-sheng
Copy link
Member

wu-sheng commented Nov 6, 2025

Please update the alarm docs accordingly, about how recovery works, what are recovery notification APIs, and how to set those up

// -1 means silence countdown is not running.
silenceCountdown = -1;
init();
this.size = period + additionalPeriod + Math.max(silencePeriod, recoveryObservationPeriod);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why you have to change the window size?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. This change is no longer needed with the latest logic and I've removed it in a new commit.

this.silencePeriod = silencePeriod;
this.recoveryObservationPeriod = recoveryObservationPeriod;
this.silenceCountdown = -1;
this.recoveryObservationCountdown = recoveryObservationPeriod;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If recoveryObservationPeriod is not set, according to the following logic, I think the alarm will trigger the recovery immediately if not match? We need a compatible logical if the user didn't config the recovery settings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. You've correctly summarized the logic. As per #13492, the current plan is to recover the alarm immediately when conditions aren't met, with the observation period being configurable (default: 0) to avoid flapping. Does this approach seem reasonable to you?​ We can adjust if needed.

@youjie23
Copy link
Author

youjie23 commented Nov 9, 2025

Please update the alarm docs accordingly, about how recovery works, what are recovery notification APIs, and how to set those up

I have updated the corresponding description in the backend-alarm.md document. Please review it at your convenience.

- **Silence period**. After the alarm is triggered at Time-N (TN), there will be silence during the **TN -> TN + period**.
By default, it works in the same manner as **period**. The same Alarm (having the same ID in the same metrics name) may only be triggered once within a period.
- **Recovery observation period**. Defines the number of consecutive periods that the alarm condition must remain false before the alarm is considered recovered. When the alarm condition becomes false, the system enters an observation period. If the condition remains false for the specified number of periods, a recovery notification is sent. If the condition becomes true again during the observation period, the alarm returns to the FIRING state.
The default value is 0, which means immediate recovery notification when the condition becomes false.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the default value, we could change the default rules into 0, and are considered as immediately recovery. But for ppl don't have this config(previous versions' users), we are better to support -1 as default value for config absent, which could provide a more consistent behaviour.
After all, you will send new notifications. The old confiiguation files don't have recovery-text-template or relative url, you should take care of them as normal cases. Otherwise, they are going to fail to boot and upgrade, then have to change all rules manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please the make the codes to support recovery period as -1 as no recovery rules. And support no recovery-text-template in hooks as no need to send recovery notifications.

Copy link
Author

@youjie23 youjie23 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.
Yes, I also agree that the upgrade should not cause any additional hassle for existing users. I have already addressed this in the latest commit: if neither recovery-text-template nor recovery-urls is configured, no recovery notification will be sent externally (though it will still be persisted to storage), and this will not affect the project's startup or upgrade process.

For example,the key logic in WebhookCallback is as follows: it checks the configured URLs using the getUrls method. For recovery notifications, it specifically uses setting.getRecoveryUrls(). If this list is empty (i.e., not configured), the loop for (final var url : urls)will not execute, thus no external notification is sent.

@Override
public void doAlarmCallback(List<AlarmMessage> alarmMessages, boolean isRecovery) throws Exception {
    // ... existing setup code ...
    List<String> urls = getUrls(setting, isRecovery);
    if (setting == null || CollectionUtils.isEmpty(urls) || CollectionUtils.isEmpty(messages)) {
        continue; // This is where it skips sending if URLs are empty
    }
    for (final var url : urls) {
        // ... send message ...
    }
}

private static List<String> getUrls(WebhookSettings setting, boolean isRecovery) {
    return isRecovery ? setting.getRecoveryUrls() : setting.getUrls(); // Returns an empty list if not configured
}

Copy link
Author

@youjie23 youjie23 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please the make the codes to support recovery period as -1 as no recovery rules. And support no recovery-text-template in hooks as no need to send recovery notifications.

Sorry. I didn't see this message when I submitted my last reply. I have already implemented the latter behavior (supporting no recovery-text-template in hooks). I'm not entirely sure if the first part (supporting recovery period as -1) is still required. The state transitions and separate storage should not introduce additional side effects, as operations like table creation are automatically handled during the startup process.

@wu-sheng
Copy link
Member

@wankai123 Let's propose a set of UTs to check alarm status changes, e.g. from alarming -> silence -> recovery.

@wu-sheng
Copy link
Member

Also, with #13570 is going to be merged, this new status should be reflected into query APIs.

@youjie23
Copy link
Author

Also, with #13570 is going to be merged, this new status should be reflected into query APIs.

Please hold off on the CI pipeline due to compilation errors after merging master. I'm looking into the errors and will update everyone once it's fixed.

@wankai123
Copy link
Member

wankai123 commented Nov 10, 2025

Let's add the different cases in the UT and check if the alarm window status changes as expected:

  1. silencePeriod and recoveryObservationPeriod are not set.
  2. Only set silencePeriod.
  3. Only set recoveryObservationPeriod.
  4. silencePeriod > recoveryObservationPeriod.
  5. recoveryObservationPeriod > silencePeriod.

The status changes should include the AlarmStateMachine current status after each match or misMatch

@wu-sheng
Copy link
Member

I will cut the release of 10.3 today. Let's make this as a part of key of 10.4

@wu-sheng wu-sheng added this to the 10.4.0 milestone Nov 10, 2025
@youjie23
Copy link
Author

youjie23 commented Nov 12, 2025

Let's add the different cases in the UT and check if the alarm window status changes as expected:

  1. silencePeriod and recoveryObservationPeriod are not set.
  2. Only set silencePeriod.
  3. Only set recoveryObservationPeriod.
  4. silencePeriod > recoveryObservationPeriod.
  5. recoveryObservationPeriod > silencePeriod.

The status changes should include the AlarmStateMachine current status after each match or misMatch

I have added the unit tests to cover all the different cases you mentioned. The tests now verify the status changes of the AlarmStateMachineafter each match and misMatch.
The changes are in RunningRuleTest. Please review when you have time. Thanks.

@youjie23
Copy link
Author

Also, with #13570 is going to be merged, this new status should be reflected into query APIs.

Done. Please review when you have time. Thanks.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 132 out of 132 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

oap-server/server-alarm-plugin/src/test/java/org/apache/skywalking/oap/server/core/alarm/provider/wechat/WechatHookCallbackTest.java:1

  • The test is passing the wrong list to doAlarmRecovery. It should pass alarmRecoveryMessages instead of alarmMessages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

this.alarmRulesWatcher = alarmRulesWatcher;
this.alarmSettingMap = new HashMap<>();
this.alarmServiceStubMap = new HashMap<>();
this.grpcClientMap = new HashMap<>();
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field alarmSettingMap is not initialized in the constructor before being used. It should be initialized as this.alarmSettingMap = new HashMap<>(); before the conditional block that uses alarmSettingMap.

Suggested change
this.grpcClientMap = new HashMap<>();
this.grpcClientMap = new HashMap<>();
this.alarmSettingMap = new HashMap<>();

Copilot uses AI. Check for mistakes.
Comment on lines 499 to 502
if (log.isTraceEnabled()) {
log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}",
ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState);
}
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate if (log.isTraceEnabled()) check on lines 498 and 499. Remove the inner duplicate check.

Suggested change
if (log.isTraceEnabled()) {
log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}",
ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState);
}
log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}",
ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState);

Copilot uses AI. Check for mistakes.
"ruleName": "service_resp_time_rule",
"alarmMessage": "alarmMessage xxxx",
"startTime": 1560524171000,
"recoveryTime": 15596606810000,
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example recovery timestamp 15596606810000 appears to be in the future (approximately year 2464). This should be a realistic timestamp that comes after the startTime value of 1560524171000.

Suggested change
"recoveryTime": 15596606810000,
"recoveryTime": 1560524271000,

Copilot uses AI. Check for mistakes.
if (log.isTraceEnabled()) {
log.trace("RuleName:{} AlarmEntity {} {} {} expired", ruleName, alarmEntity.getName(),
alarmEntity.getId0(), alarmEntity.getId1());
}
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The expired entities are being logged but then removed from the window. The removal happens after the forEach completes. Consider adding a return statement after logging to skip further processing of expired entities in the same iteration.

Suggested change
}
}
return;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend OAP backend related. feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Enhance the alarm kernel with recovered status notification capability for alarm rules.

3 participants