Silences slow to update alerts in Alertmanager with Sharding Enabled

**Describe the bug**
I am running alertmanager with sharding enabled with 3 pods running and I'm experiencing  long delays between creating silences and it reflecting that in the alert UI/API (~15min) and vice versa when expiring the silence.

For example,
1. I create a silence from the UI/API
2. The effected alerts still appear in the UI and API response for quite a long time.
3. Eventually the effected alerts are removed from the list of active alerts.

Where as I expected it would almost instantly reflect the changes.

**To Reproduce**
Steps to reproduce the behavior:
1. Start Cortex v1.17.1 (With the config defined in 'Additional Context'
2. Navigate to the alertmanager UI
3. Send a test alert (can be anything)
4. Create a silence, matching one of the labels in the test alert.
5. Navigate back to the Alerts page
6. Confirm that the alert is still showing, even after creating the silence.

I've attached a video of the test scenario.
https://github.com/user-attachments/assets/cba15fa8-a2fd-4ed5-ad71-01207a035727

**Expected behavior**
When creating a silence, I except the matched alerts to be silenced almost immediately. Instead it takes a few minutes for the alerts to be changed from active to silenced.

**Environment:**
 - Infrastructure: Kubernetes
 - Deployment tool: Kustomize

**Additional Context**
Config Used:
```
api:
  alertmanager_http_prefix: /alertmanager
server:
  log_level: debug
memberlist:
  bind_port: 7946
  join_members:
    - 'cortex-alertmanager-memberlist' # A headless service, pointing to the cortex alertmanager pods.
alertmanager:
  data_dir: /data
  enable_api: true
  external_url: /alertmanager
  persist_interval: 1m
  sharding_enabled: true 
  sharding_ring:
    kvstore:
      store: memberlist     
    replication_factor: 3
  alertmanager_client:
    grpc_compression: gzip
alertmanager_storage:
  backend: gcs
  gcs:
    bucket_name: ${BUCKET_NAME}
    service_account: ${SERVICE_ACCOUNT}
runtime_config:
  file: /etc/cortex-rt/runtime.yml
```
Deployed as a statefulset on kubernetes, running on 3+ replicas.

I've noticed while looking at the logs, the silences tend to actually silence the alerts after the silences Maintenance is done on all replicas. Looking at the code, the maintenance period is hardcoded to 15min.
```
cortex-alertmanager-0 alertmanager ts=2024-10-02T02:59:16.711960932Z caller=silence.go:411 level=debug component=MultiTenantAlertmanager user=user1 component=silences msg="Running maintenance"
cortex-alertmanager-0 alertmanager ts=2024-10-02T02:59:16.812300961Z caller=silence.go:419 level=debug component=MultiTenantAlertmanager user=user1 component=silences msg="Maintenance done" duration=100.331208ms size=1545
```

Also, I have tried changing various configs such as poll_interval, push_pull_interval, persist_interval, grpc_compression, gc_interval without much luck. I have tried consul as the kvstore as well. Seemed to make no difference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Silences slow to update alerts in Alertmanager with Sharding Enabled #6248

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Silences slow to update alerts in Alertmanager with Sharding Enabled #6248

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions