Audience: SREs setting up monitoring and alerting for the adapter.
This document provides recommended alerting rules and monitoring queries for the hyperfleet-adapter.
For the canonical list of all metrics, labels, and descriptions, see metrics.md. Metrics are served on port 9090 at /metrics. For health endpoint documentation, see runbook.md#health-checks.
alert: HyperFleetAdapterDown
expr: >
hyperfleet_adapter_up == 0
or
absent(hyperfleet_adapter_up{component="hyperfleet-adapter"})
for: 1m
labels:
severity: critical
annotations:
summary: "HyperFleet Adapter is down"
description: "Adapter {{ $labels.component }} has been down for more than 1 minute."Note:
hyperfleet_adapter_upis explicitly set to 0 only during graceful shutdown. On crash (OOM, panic, node failure), the metric goes stale rather than becoming 0. Theabsent()clause covers this case. It will also fire if the metric has never been scraped (e.g., fresh Prometheus deployment) — expect initial noise until the adapter registers.
alert: HyperFleetAdapterHighFailureRate
expr: |
sum by (component, version) (rate(hyperfleet_adapter_events_processed_total{status="failed"}[5m]))
/
sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[5m]))
> 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High event failure rate"
description: "More than 10% of events are failing for {{ $labels.component }}."alert: HyperFleetAdapterNoEventsProcessed
expr: |
(
sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[15m])) == 0
and on(component, version) hyperfleet_adapter_up == 1
)
or
(
hyperfleet_adapter_up == 1
unless on(component, version) hyperfleet_adapter_events_processed_total
)
for: 5m
labels:
severity: warning
annotations:
summary: "No events processed"
description: "Adapter {{ $labels.component }} has not processed any events in ~20 minutes."Timing:
rate(...[15m])takes ~15 minutes to reach zero after the last event, plus thefor: 5mpending period. Total delay before firing is ~20 minutes. Theunlessclause handles fresh deployments where the counter has never been incremented — it fires when the adapter is up but no events metric exists yet.
alert: HyperFleetAdapterSlowProcessing
expr: |
histogram_quantile(0.95,
sum by (component, version, le) (
rate(hyperfleet_adapter_event_processing_duration_seconds_bucket[5m])
)
) > 60
for: 5m
labels:
severity: warning
annotations:
summary: "Slow event processing"
description: "P95 event processing time exceeds 60 seconds for {{ $labels.component }}."Note: The
sum by (component, version, le)aggregation merges histogram buckets across replicas before computing the quantile, giving a correct cluster-wide P95. Without this, each replica's P95 would be computed independently.
alert: HyperFleetBrokerErrors
expr: rate(hyperfleet_broker_errors_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Broker errors detected"
description: "Broker errors occurring for {{ $labels.component }}: {{ $labels.error_type }}."alert: HyperFleetAdapterErrorsRising
expr: rate(hyperfleet_adapter_errors_total[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Adapter errors rising"
description: "Error rate for {{ $labels.error_type }} exceeds 0.5/s on {{ $labels.component }}."For basic metric queries and examples, see metrics.md#example-promql-queries.