Skip to content

Commit

Permalink
Merge pull request #269 from lidofinance/feature/critical-csm-alerts-3
Browse files Browse the repository at this point in the history
feat: critical alerts by modules - 4
  • Loading branch information
AlexanderLukin authored Jan 27, 2025
2 parents 10f074d + 627cbc9 commit 28964f6
Show file tree
Hide file tree
Showing 24 changed files with 571 additions and 152 deletions.
3 changes: 0 additions & 3 deletions .env.example.compose
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,3 @@ VALIDATOR_REGISTRY_SOURCE=lido
# Critical alerts (optional).
# CRITICAL_ALERTS_ALERTMANAGER_URL=http://alertmanager:9093
# CRITICAL_ALERTS_MIN_VAL_COUNT=1

# Discord web-hook (optional).
# DISCORD_WEBHOOK_URL=https://discord.com/api/webhooks/...
3 changes: 0 additions & 3 deletions .env.example.local
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,3 @@ VALIDATOR_REGISTRY_SOURCE=lido
# Critical alerts (optional).
# CRITICAL_ALERTS_ALERTMANAGER_URL=http://alertmanager:9093
# CRITICAL_ALERTS_MIN_VAL_COUNT=1

# Discord web-hook (optional).
# DISCORD_WEBHOOK_URL=https://discord.com/api/webhooks/...
184 changes: 168 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,8 @@ Holesky) this value should be omitted.
* **Default:** ./docker/validators/lido_mainnet.db
* **Note:** it makes sense to change default value if `VALIDATOR_REGISTRY_SOURCE` is set to "lido"
---
`VALIDATOR_REGISTRY_KEYSAPI_SOURCE_URLS` - Comma-separated list of URLs to [Lido Keys API service](https://github.com/lidofinance/lido-keys-api).
`VALIDATOR_REGISTRY_KEYSAPI_SOURCE_URLS` - Comma-separated list of URLs to
[Lido Keys API service](https://github.com/lidofinance/lido-keys-api).
* **Required:** false
* **Note:** will be used only if `VALIDATOR_REGISTRY_SOURCE` is set to "keysapi"
---
Expand All @@ -278,55 +279,206 @@ Holesky) this value should be omitted.
* **Required:** false
* **Default:** 2
---
`VALIDATOR_USE_STUCK_KEYS_FILE` - Use a file with list of validators that are stuck and should be excluded from the monitoring metrics.
`VALIDATOR_USE_STUCK_KEYS_FILE` - Use a file with list of validators that are stuck and should be excluded from the
monitoring metrics.
* **Required:** false
* **Values:** true / false
* **Default:** false
---
`VALIDATOR_STUCK_KEYS_FILE_PATH` - Path to file with list of validators that are stuck and should be excluded from the monitoring metrics.
`VALIDATOR_STUCK_KEYS_FILE_PATH` - Path to file with list of validators that are stuck and should be excluded from the
monitoring metrics.
* **Required:** false
* **Default:** ./docker/validators/stuck_keys.yaml
* **Note:** will be used only if `VALIDATOR_USE_STUCK_KEYS_FILE` is true
---
`SYNC_PARTICIPATION_DISTANCE_DOWN_FROM_CHAIN_AVG` - Distance (down) from Blockchain Sync Participation average after which we think that our sync participation is bad.
`SYNC_PARTICIPATION_DISTANCE_DOWN_FROM_CHAIN_AVG` - Distance (down) from Blockchain Sync Participation average after
which we think that our sync participation is bad.
* **Required:** false
* **Default:** 0
---
`SYNC_PARTICIPATION_EPOCHS_LESS_THAN_CHAIN_AVG` - Number epochs after which we think that our sync participation is bad and alert about that.
`SYNC_PARTICIPATION_EPOCHS_LESS_THAN_CHAIN_AVG` - Number epochs after which we think that our sync participation is bad
and alert about that.
* **Required:** false
* **Default:** 3
---
`BAD_ATTESTATION_EPOCHS` - Number epochs after which we think that our attestation is bad and alert about that.
* **Required:** false
* **Default:** 3
---
`CRITICAL_ALERTS_ALERTMANAGER_URL` - If passed, application sends additional critical alerts about validators performance to Alertmanager.
`CRITICAL_ALERTS_ALERTMANAGER_URL` - If passed, application sends additional critical alerts about validators
performance to Alertmanager.
* **Required:** false
---
`CRITICAL_ALERTS_MIN_VAL_COUNT` - Critical alerts will be sent for Node Operators with validators count greater this value.
`CRITICAL_ALERTS_MIN_VAL_COUNT` - Critical alerts will be sent for Node Operators with validators count greater or equal
to this value.
* **Required:** false
* **Default:** 100
---
`CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` - Sets the minimum conditions for triggering critical alerts based on the number
of active validators for node operators in a specific module.

The value must be in JSON format. Example:
`{ "0": { "minActiveCount": 100, "affectedShare": 0.33, "minAffectedCount": 1000 } }`.

The numeric key represents the module ID. Settings under the `0` key apply to all modules unless overridden by settings
for specific module IDs. Settings for specific module IDs take precedence over the `0` key.

A critical alert is sent if:

* The number of active validators for a node operator meets or exceeds `minActiveCount`.
* The number of affected validators:
* Is at least `affectedShare` of the total validators for the node operator, OR
* Exceeds or equal to `minAffectedCount`.
* Value in the `CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` for specific module is not overridden by
`CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT`.

If no settings are provided for a specific module or the 0 key, default values are used:
`{ "minActiveCount": CRITICAL_ALERTS_MIN_VAL_COUNT, "affectedShare": 0.33, "minAffectedCount": 1000 }`.
* **Required:** false
* **Default:** {}
---
`CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT` - Defines the minimum number of affected validators for a node operator in a
specific module for which a critical alert should be sent.

The value must be in JSON format, for example: `{ "0": 100, "3": 50 }`. The numeric key represents the module ID. The
value for the key `0` applies to all modules. Values for non-zero keys apply only to the specified module and take
precedence over the `0` key.

This variable takes priority over `CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` and `CRITICAL_ALERTS_MIN_VAL_COUNT`. If no
value is set for a specific module or the `0` key, the rules from the other two variables will apply instead.
* **Required:** false
* **Default:** {}
---
`CRITICAL_ALERTS_ALERTMANAGER_LABELS` - Additional labels for critical alerts.
Must be in JSON string format. Example - '{"a":"valueA","b":"valueB"}'.
Must be in JSON string format. Example: `{ "a": "valueA", "b": "valueB" }`.
* **Required:** false
* **Default:** {}
---

## Application critical alerts (via Alertmanager)

In addition to alerts based on Prometheus metrics you can receive special critical alerts based on beaconchain aggregates from app.
In addition to alerts based on Prometheus metrics you can receive special critical alerts based on Beacon Chain
aggregates from app.

You should pass env var `CRITICAL_ALERTS_ALERTMANAGER_URL=http://<alertmanager_host>:<alertmanager_port>`.

And if `ethereum_validators_monitoring_data_actuality < 1h` it allows you to receive alerts from table bellow
Critical alerts for modules are controlled by three environment variables, listed here with their priority (from lowest
to highest):
```
CRITICAL_ALERTS_MIN_VAL_COUNT: number;
CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT: {
<moduleIndex>: {
minActiveCount: number,
affectedShare: number,
minAffectedCount: number,
}
};
CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT: {
<moduleIndex>: number
};
```

The following rules are applied (listed in order of increasing priority, the next rule overrides the previous one).

| Alert name | Description | If fired repeat | If value increased repeat |
|----------------------------|-----------------------------------------------------------------------------------------------------------------|-----------------|---------------------------|
| CriticalSlashing | At least one validator was slashed | instant | - |
| CriticalMissedProposes | More than 1/3 blocks from Node Operator duties was missed in the last 12 hours | every 6h | - |
| CriticalNegativeDelta | More than 1/3 or more than 1000 Node Operator validators with negative balance delta (between current and 6 epochs ago) | every 6h | every 1h |
| CriticalMissedAttestations | More than 1/3 or more than 1000 Node Operator validators with missed attestations in the last {{ BAD_ATTESTATION_EPOCHS }} epochs | every 6h | every 1h |
1. **Global Fallback** (`CRITICAL_ALERTS_MIN_VAL_COUNT`). If this variable is set, it acts as a default for modules by
creating an implicit rule:
```
{
"0": {
"minActiveCount": CRITICAL_ALERTS_MIN_VAL_COUNT,
"affectedShare": 0.33,
"minAffectedCount": 1000
}
}
```

2. **Global Rules for Active Validators** (`CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT`). Default rules apply to all modules
(key `0`) unless overridden.
```
CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT = {
"0": {
"minActiveCount": <integer>,
"affectedShare": <0.xx>,
"minAffectedCount": <integer>,
}
}
```
A critical alert is triggered for a module if **both** conditions are met:
* Active validators exceed or equal to `minActiveCount`.
* Affected validators exceed or equal to either `minAffectedCount` or `affectedShare` of the total active validators.

3. **Global Rules for Affected Validators** (`CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT`). Default rules apply to all
modules (key `0`) unless overridden.
```
CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT = {
"0": <integer>
}
```
A critical alert is triggered if the number of affected validators exceeds or equal to this value.

4. **Per-Module Rules for Active Validators** (`CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT`). If specific module keys are
defined, those values override the global rules for `CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` and
`CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT`.
```
CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT = {
"n": {
"minActiveCount": <integer>,
"affectedShare": <0.xx>,
"minAffectedCount": <integer>,
}
}
```
A critical alert is triggered for those modules if **both** conditions are met:

* Active validators exceed or equal to `minActiveCount`.
* Affected validators exceed or equal either `minAffectedCount` or `affectedShare` of the total validators.

For modules that don't have keys in the `CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` the rules defined in the previous steps
are applied.

5. **Per-Module Rules for Affected Validators** (`CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT`). If specific module keys are
defined, those values override all other rules for the module.
```
CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT = {
"n": <integer>
}
```
A critical alert is triggered if the number of affected validators exceeds or equal to the specified value.

To illustrate these rules let's consider the following sample config:
```
CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT = {
"0": {
"minActiveCount": 100,
"affectedShare": 0.3,
"minAffectedCount": 1000,
},
"3": {
"minActiveCount": 10,
"affectedShare": 0.5,
"minAffectedCount": 200,
},
};
CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT = {
"2": 30
};
```
In this case, critical alerts for any modules except 2 and 3 will be triggered for operators with at least 100 active
validators and only if either at least 1000 or 30% of active validators are affected by a critical alert (depending on
what number is less). However, for operators from the 3-rd module, these rules are weakened: a critical alert will be
triggered for operators with at least 10 active validators and only if either 200 or 50% of validators are affected.

These rules are not applied to the 2-nd module. For this module, critical alerts will be triggered for all operators
with at least 30 affected validators (no matter how many active validators they have).

If `ethereum_validators_monitoring_data_actuality < 1h` alerts from table bellow are sent.

| Alert name | Description | If fired repeat | If value increased repeat |
|----------------------------|---------------------------------------------------------------------------------------------------------|-----------------|---------------------------|
| CriticalSlashing | At least one validator was slashed | instant | - |
| CriticalMissedProposes | More than 1/3 blocks from Node Operator duties was missed in the last 12 hours | every 6h | - |
| CriticalNegativeDelta | A certain number of validators with negative balance delta (between current and 6 epochs ago) | every 6h | every 1h |
| CriticalMissedAttestations | A certain number of validators with missed attestations in the last `{{BAD_ATTESTATION_EPOCHS}}` epochs | every 6h | every 1h |


## Application metrics
Expand Down
4 changes: 2 additions & 2 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ services:
deploy:
resources:
limits:
memory: 256m
memory: 512m
volumes:
- ./.volumes/prometheus/:/prometheus
- ./docker/prometheus/:/etc/prometheus/
Expand All @@ -75,7 +75,7 @@ services:
- '8083:8080'

alertmanager:
image: prom/alertmanager:latest
image: prom/alertmanager:v0.26.0
container_name: alertmanager
restart: unless-stopped
deploy:
Expand Down
24 changes: 12 additions & 12 deletions docker/prometheus/alerts_rules.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,24 @@ groups:
annotations:
emoji: 🔪
summary: "Operators have slashed validators"
description: 'Number of slashed validators per operator'
description: 'Number of slashed validators per operator.'
field_name: '{{ $labels.nos_name }}'
field_value: '[{{ $value | printf "%.0f" }}](http://127.0.0.1:8082/d/3wimU2H7h/nodeoperators/?var-nos_name_var={{ urlquery $labels.nos_name }}&from={{ with query "(time() - 1200) * 1000" }}{{ . | first | value | printf "%f" }}{{ end }}&to={{ with query "time() * 1000" }}{{ . | first | value | printf "%f" }}{{ end }})'
url: "http://127.0.0.1:8082/d/HRgPmpNnz/validators"
footer_text: 'Epoch • {{ with query "ethereum_validators_monitoring_epoch_number" }}{{ . | first | value | printf "%.0f" }}{{ end }}'
footer_icon_url: "https://cryptologos.cc/logos/steth-steth-logo.png"

- alert: DataActuality
expr: absent(ethereum_validators_monitoring_data_actuality) OR (ethereum_validators_monitoring_data_actuality / 1000 > 3600)
expr: ethereum_validators_monitoring_data_actuality > 3600000 OR absent(ethereum_validators_monitoring_data_actuality)
for: 1m
labels:
severity: critical
annotations:
emoji:
summary: "Data actuality greater then 1 hour"
resolved_summary: "Data actuality is back to normal and now less then 1 hour"
description: "({{ humanizeDuration $value }}) It's not OK. Please, check app health"
resolved_description: "It's OK"
resolved_summary: "Data actuality is back to normal and now less then 1 hour."
description: "({{ humanizeDuration $value }}) It's not OK. Please, check app health."
resolved_description: "It's OK."
url: "http://127.0.0.1:8082/d/HRgPmpNnz/validators"
footer_text: 'Epoch • {{ with query "ethereum_validators_monitoring_epoch_number" }}{{ . | first | value | printf "%.0f" }}{{ end }}'
footer_icon_url: "https://cryptologos.cc/logos/steth-steth-logo.png"
Expand All @@ -38,7 +38,7 @@ groups:
annotations:
emoji: 💸
summary: 'Operators have a negative balance delta'
resolved_summary: 'Operators have a positive balance delta'
resolved_summary: 'Operators have a positive balance delta.'
description: 'Number of validators per operator who have a negative balance delta.'
resolved_description: 'Number of validators per operator who recovered.'
field_name: '{{ $labels.nos_name }}'
Expand All @@ -54,7 +54,7 @@ groups:
annotations:
emoji: 📝❌
summary: 'Operators have missed attestation in last {{ $labels.epoch_interval }} finalized epochs'
resolved_summary: 'Operators not have missed attestation in last {{ $labels.epoch_interval }} finalized epochs'
resolved_summary: 'Operators not have missed attestation in last {{ $labels.epoch_interval }} finalized epochs.'
description: 'Number of validators per operator who have missed attestations.'
resolved_description: 'Number of validators per operator who recovered.'
field_name: '{{ $labels.nos_name }}'
Expand Down Expand Up @@ -98,7 +98,7 @@ groups:
annotations:
emoji: 📥
summary: 'Operators missed block propose in the last finalized epoch'
resolved_summary: 'Operators not missed block propose in the last finalized epoch'
resolved_summary: 'Operators not missed block propose in the last finalized epoch.'
description: 'Number of validators per operator who missed block propose.'
resolved_description: 'Number of validators per operator who recovered.'
field_name: '{{ $labels.nos_name }}'
Expand All @@ -114,7 +114,7 @@ groups:
annotations:
emoji: 🔄
summary: 'Operators sync participation less than average in last {{ $labels.epoch_interval }} finalized epochs'
resolved_summary: 'Operators sync participation higher or equal than average in last {{ $labels.epoch_interval }} finalized epochs'
resolved_summary: 'Operators sync participation higher or equal than average in last {{ $labels.epoch_interval }} finalized epochs.'
description: 'Number of validators per operator whose sync participation less than average.'
resolved_description: 'Number of validators per operator who recovered.'
field_name: '{{ $labels.nos_name }}'
Expand All @@ -129,7 +129,7 @@ groups:
severity: critical
annotations:
emoji: '📈🔄'
summary: 'Operators may get high rewards in the future, but sync participation less than average in last {{ $labels.epoch_interval }} finalized epochs!'
summary: 'Operators may get high rewards in the future, but sync participation less than average in last {{ $labels.epoch_interval }} finalized epochs'
resolved_summary: 'Operators sync participation higher or equal than average in last {{ $labels.epoch_interval }} finalized epoch. Now may get high rewards in the future!'
description: 'Number of validators per operator whose sync participation less than average.'
resolved_description: 'Number of validators per operator who recovered.'
Expand All @@ -145,7 +145,7 @@ groups:
severity: critical
annotations:
emoji: '📈📝❌'
summary: 'Operators may get high rewards in the future, but missed attestation in last {{ $labels.epoch_interval }} finalized epochs!'
summary: 'Operators may get high rewards in the future, but missed attestation in last {{ $labels.epoch_interval }} finalized epochs'
resolved_summary: 'Operators not have missed attestation in last {{ $labels.epoch_interval }} finalized epochs. Now may get high rewards in the future!'
description: 'Number of validators per operator who have missed attestations.'
resolved_description: 'Number of validators per operator who recovered.'
Expand All @@ -161,7 +161,7 @@ groups:
severity: critical
annotations:
emoji: '📈📥'
summary: 'Operators may get high rewards in the future, but missed block propose in the last finalized epoch!'
summary: 'Operators may get high rewards in the future, but missed block propose in the last finalized epoch'
resolved_summary: 'Operators not missed block propose in the last finalized epoch. Now may get high rewards in the future!'
description: 'Number of validators per operator who missed block propose.'
resolved_description: 'Number of validators per operator who recovered.'
Expand Down
Loading

0 comments on commit 28964f6

Please sign in to comment.