-
Notifications
You must be signed in to change notification settings - Fork 6.6k
feat:Enhance the alarm kernel with recovered status notification capability for alarm rules #13539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
youjie23
wants to merge
35
commits into
apache:master
Choose a base branch
from
youjie23:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,016
−408
Open
Changes from 29 commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
7481ca8
enhance the alarm kernel with recovered status notification capabilit…
youjie23 4b54c18
enhance the alarm kernel with recovered status notification capabilit…
youjie23 638668f
enhance the alarm kernel with recovered status notification capabilit…
youjie23 0acfbe5
enhance the alarm kernel with recovered status notification capabilit…
youjie23 92cfeed
enhance the alarm kernel with recovered status notification capabilit…
youjie23 edc2722
enhance the alarm kernel with recovered status notification capabilit…
youjie23 a7edf5c
enhance the alarm kernel with recovered status notification capabilit…
youjie23 f140f6e
enhance the alarm kernel with recovered status notification capabilit…
youjie23 cf0570b
enhance the alarm kernel with recovered status notification capabilit…
youjie23 a53f9c2
Merge branch 'master' into master
youjie23 d4ad7c0
Merge branch 'master' into master
wu-sheng 5829a48
enhance the alarm kernel with recovered status notification capabilit…
youjie23 9b10401
Merge branch 'master' of github.com:youjie23/skywalking
youjie23 602262d
Merge branch 'master' into master
youjie23 4688cf7
merge master
youjie23 f97ad0c
enhance the alarm kernel with recovered status notification capabilit…
youjie23 239439c
Merge branch 'master' into master
youjie23 587b2aa
chore(e2e): set allowed times to <=0 for endless trigger simulation
youjie23 e8b6200
Merge branch 'master' of github.com:youjie23/skywalking
youjie23 6b1f926
Merge branch 'master' into master
wu-sheng 88d2c85
Merge branch 'master' into master
wu-sheng 783ac8b
Merge branch 'master' into master
wu-sheng c4da5d2
chore:add logs for troubleshooting
youjie23 c080b31
Merge branch 'master' of github.com:youjie23/skywalking
youjie23 c6a8d83
chore:add logs for troubleshooting
youjie23 9c8651c
Revert "chore:add logs for troubleshooting"
youjie23 7c2b0f5
chore: remove the commented-out code
youjie23 4dcff48
enhance the alarm kernel with recovered status notification capabilit…
youjie23 5307baf
enhance the alarm kernel with recovered status notification capabilit…
youjie23 6ff7817
Merge branch 'master' into master
youjie23 ca113a5
enhance the alarm kernel with recovered status notification capabilit…
youjie23 f65414b
Merge branch 'master' into master
youjie23 4c1e2c6
fix Copilot review and CI fail
youjie23 06a96e8
Merge branch 'master' into master
youjie23 37cc68a
Merge branch 'master' into master
youjie23 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -39,6 +39,9 @@ The metrics names in the expression could be found in the [list of all potential | |||||
| If the hook name is not specified, the global hook will be used. | ||||||
| - **Silence period**. After the alarm is triggered at Time-N (TN), there will be silence during the **TN -> TN + period**. | ||||||
| By default, it works in the same manner as **period**. The same Alarm (having the same ID in the same metrics name) may only be triggered once within a period. | ||||||
| - **Recovery observation period**. Defines the number of consecutive periods that the alarm condition must remain false before the alarm is considered recovered. When the alarm condition becomes false, the system enters an observation period. If the condition remains false for the specified number of periods, a recovery notification is sent. If the condition becomes true again during the observation period, the alarm returns to the FIRING state. | ||||||
| The default value is 0, which means immediate recovery notification when the condition becomes false. | ||||||
|
|
||||||
|
|
||||||
| Such as for a metric, there is a shifting window as following at T7. | ||||||
|
|
||||||
|
|
@@ -52,6 +55,7 @@ Such as for a metric, there is a shifting window as following at T7. | |||||
| For example, expression `avg(service_resp_time) > 1000`, if the value are `1001, 1001, 1001, 1001, 1001, 1001, 1001`, | ||||||
| the calculation is `((1001 + 10001 + ... + 1001) / 7) > 1000` and the result would be `1`(true). Then the alarm would be triggered. | ||||||
| * In every minute, the window would shift automatically. At T8, Value8 would be cached, and T1/Value1 would be removed from the window. | ||||||
| * If Value8 is 890, the expression will be calculated based on the metric values from T2 to T8, which are `1001, 1001, 1001, 1001, 1001, 1001, 990`. The calculation becomes `((1001 + 1001 + ... + 890) / 7) < 1000`, and the result would be `0`(false). Consequently, the alarm enters an observation period for recovery. If the `Recovery observation period`is not set or is set to `0`, the alarm is considered recovered immediately, and a recovery notification is sent. Otherwise, the system will wait and observe the condition over the specified number of subsequent periods before declaring recovery. | ||||||
|
|
||||||
| **NOTE**: | ||||||
| * If the expression include labeled metrics and result has multiple labeled value(e.g. `sum(service_percentile{p='50,75'} > 1000) >= 3`), the alarm will be triggered if any of the labeled value result matches 3 times of the condition(P50 > 1000 or P75 > 1000). | ||||||
|
|
@@ -69,6 +73,8 @@ rules: | |||||
| period: 10 | ||||||
| # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period. | ||||||
| silence-period: 10 | ||||||
| # Number of periods to wait before considering the alarm recovered,default as 0. | ||||||
| recovery-observation-period: 2 | ||||||
| message: Successful rate of endpoint {name} is lower than 75% | ||||||
| tags: | ||||||
| level: WARNING | ||||||
|
|
@@ -163,6 +169,14 @@ hooks: | |||||
| "text": ":alarm_clock: *Apache Skywalking Alarm* \n **%s**." | ||||||
| } | ||||||
| } | ||||||
| recovery-text-template: |- | ||||||
| { | ||||||
| "type": "section", | ||||||
| "text": { | ||||||
| "type": "mrkdwn", | ||||||
| "text": ":green_heart: *Apache SkyWalking Alarm Recovered* \n **%s**." | ||||||
| } | ||||||
| } | ||||||
| webhooks: | ||||||
| - https://hooks.slack.com/services/x/y/zssss | ||||||
| custom1: | ||||||
|
|
@@ -192,12 +206,16 @@ webhook: | |||||
| custom1: | ||||||
| urls: | ||||||
| - http://127.0.0.1/custom1 | ||||||
| recovery-urls: | ||||||
| - http://127.0.0.1/custom1 | ||||||
| # headers config is provided to add custom configurations or authentications that are required from the server side. | ||||||
| headers: | ||||||
| Authorization: Bearer bearer_token | ||||||
| custom2: | ||||||
| urls: | ||||||
| - http://127.0.0.1/custom2 | ||||||
| recovery-urls: | ||||||
| - http://127.0.0.1/custom2 | ||||||
| # headers config is provided to add custom configurations or authentications that are required from the server | ||||||
| headers: | ||||||
| Authorization: Basic basic_token | ||||||
|
|
@@ -213,11 +231,13 @@ webhook: | |||||
| The JSON format is based on `List<org.apache.skywalking.oap.server.core.alarm.AlarmMessage>` with the following key information: | ||||||
| - **scopeId**, **scope**. All scopes are defined in `org.apache.skywalking.oap.server.core.source.DefaultScopeDefine`. | ||||||
| - **name**. Target scope entity name. Please follow the [entity name definitions](#entity-name). | ||||||
| - **uuid** : The unique identifier (UUID) of the alarm, which is consistent between the trigger and recovery messages. | ||||||
| - **id0**. The ID of the scope entity that matches with the name. When using the relation scope, it is the source entity ID. | ||||||
| - **id1**. When using the relation scope, it is the destination entity ID. Otherwise, it is empty. | ||||||
| - **ruleName**. The rule name configured in `alarm-settings.yml`. | ||||||
| - **alarmMessage**. The alarm text message. | ||||||
| - **startTime**. The alarm time measured in milliseconds, which occurs between the current time and the midnight of January 1, 1970 UTC. | ||||||
| - **startTime**. The time, in milliseconds since the Unix epoch (January 1, 1970 UTC), when the alarm was triggered. | ||||||
| - **recoveryTime**. The time, in milliseconds since the Unix epoch (January 1, 1970 UTC), when the alarm was recovered. This value is `null` if the alarm has not been recovered. | ||||||
| - **tags**. The tags configured in `alarm-settings.yml`. | ||||||
|
|
||||||
| See the following example: | ||||||
|
|
@@ -226,11 +246,13 @@ See the following example: | |||||
| "scopeId": 1, | ||||||
| "scope": "SERVICE", | ||||||
| "name": "serviceA", | ||||||
| "uuid": "uuid1", | ||||||
| "id0": "12", | ||||||
| "id1": "", | ||||||
| "ruleName": "service_resp_time_rule", | ||||||
| "ruleName": "service_resp_time_rule", | ||||||
| "alarmMessage": "alarmMessage xxxx", | ||||||
| "startTime": 1560524171000, | ||||||
| "recoveryTime": 15596606810000, | ||||||
|
||||||
| "recoveryTime": 15596606810000, | |
| "recoveryTime": 1560524271000, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the default value, we could change the default rules into 0, and are considered as immediately recovery. But for ppl don't have this config(previous versions' users), we are better to support
-1as default value for config absent, which could provide a more consistent behaviour.After all, you will send new notifications. The old confiiguation files don't have
recovery-text-templateor relative url, you should take care of them as normal cases. Otherwise, they are going to fail to boot and upgrade, then have to change all rules manually.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please the make the codes to support recovery period as -1 as no recovery rules. And support no
recovery-text-templatein hooks as no need to send recovery notifications.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review.
Yes, I also agree that the upgrade should not cause any additional hassle for existing users. I have already addressed this in the latest commit: if neither recovery-text-template nor recovery-urls is configured, no recovery notification will be sent externally (though it will still be persisted to storage), and this will not affect the project's startup or upgrade process.
For example,the key logic in WebhookCallback is as follows: it checks the configured URLs using the getUrls method. For recovery notifications, it specifically uses setting.getRecoveryUrls(). If this list is empty (i.e., not configured), the loop for (final var url : urls)will not execute, thus no external notification is sent.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I didn't see this message when I submitted my last reply. I have already implemented the latter behavior (supporting no recovery-text-template in hooks). I'm not entirely sure if the first part (supporting recovery period as -1) is still required. The state transitions and separate storage should not introduce additional side effects, as operations like table creation are automatically handled during the startup process.