Description
Describe the bug
I am running alertmanager with sharding enabled with 3 pods running and I'm experiencing long delays between creating silences and it reflecting that in the alert UI/API (~15min) and vice versa when expiring the silence.
For example,
- I create a silence from the UI/API
- The effected alerts still appear in the UI and API response for quite a long time.
- Eventually the effected alerts are removed from the list of active alerts.
Where as I expected it would almost instantly reflect the changes.
To Reproduce
Steps to reproduce the behavior:
- Start Cortex v1.17.1 (With the config defined in 'Additional Context'
- Navigate to the alertmanager UI
- Send a test alert (can be anything)
- Create a silence, matching one of the labels in the test alert.
- Navigate back to the Alerts page
- Confirm that the alert is still showing, even after creating the silence.
I've attached a video of the test scenario.
https://github.com/user-attachments/assets/cba15fa8-a2fd-4ed5-ad71-01207a035727
Expected behavior
When creating a silence, I except the matched alerts to be silenced almost immediately. Instead it takes a few minutes for the alerts to be changed from active to silenced.
Environment:
- Infrastructure: Kubernetes
- Deployment tool: Kustomize
Additional Context
Config Used:
api:
alertmanager_http_prefix: /alertmanager
server:
log_level: debug
memberlist:
bind_port: 7946
join_members:
- 'cortex-alertmanager-memberlist' # A headless service, pointing to the cortex alertmanager pods.
alertmanager:
data_dir: /data
enable_api: true
external_url: /alertmanager
persist_interval: 1m
sharding_enabled: true
sharding_ring:
kvstore:
store: memberlist
replication_factor: 3
alertmanager_client:
grpc_compression: gzip
alertmanager_storage:
backend: gcs
gcs:
bucket_name: ${BUCKET_NAME}
service_account: ${SERVICE_ACCOUNT}
runtime_config:
file: /etc/cortex-rt/runtime.yml
Deployed as a statefulset on kubernetes, running on 3+ replicas.
I've noticed while looking at the logs, the silences tend to actually silence the alerts after the silences Maintenance is done on all replicas. Looking at the code, the maintenance period is hardcoded to 15min.
cortex-alertmanager-0 alertmanager ts=2024-10-02T02:59:16.711960932Z caller=silence.go:411 level=debug component=MultiTenantAlertmanager user=user1 component=silences msg="Running maintenance"
cortex-alertmanager-0 alertmanager ts=2024-10-02T02:59:16.812300961Z caller=silence.go:419 level=debug component=MultiTenantAlertmanager user=user1 component=silences msg="Maintenance done" duration=100.331208ms size=1545
Also, I have tried changing various configs such as poll_interval, push_pull_interval, persist_interval, grpc_compression, gc_interval without much luck. I have tried consul as the kvstore as well. Seemed to make no difference.