Feat: Add saturation and capacity Prometheus metrics for WVA observability#933
Feat: Add saturation and capacity Prometheus metrics for WVA observability#933ev-shindin wants to merge 4 commits intollm-d:mainfrom
Conversation
9f69455 to
b63fd32
Compare
|
/trigger-e2e-full |
|
🚀 Kind E2E (full) triggered by |
|
/ok-to-test |
|
🚀 OpenShift E2E — approve and run ( |
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
|
/ok-to-test |
|
🚀 Kind E2E (full) triggered by |
|
🚀 OpenShift E2E — approve and run ( |
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
2 similar comments
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
|
/ok-to-test |
|
🚀 OpenShift E2E — approve and run ( |
|
🚀 Kind E2E (full) triggered by |
|
@ev-shindin the Openshift tests are failing. Can you PTAL? |
|
@shuynh2017 can you PTAL? |
|
/ok-to-test |
|
🚀 Kind E2E (full) triggered by |
|
🚀 OpenShift E2E — approve and run ( |
Add 5 new Prometheus gauge metrics exposing saturation analysis outputs that drive scaling decisions, giving operators visibility into why scaling happens: - wva_saturation_utilization: per-variant utilization ratio (0.0-1.0) - wva_spare_capacity: per-variant spare capacity (0.0-1.0) - wva_required_capacity: model-level required capacity (>0 = scale-up) - wva_kv_cache_tokens_used: KV cache tokens in use per variant - wva_kv_cache_tokens_total: KV cache token capacity per variant Metrics are populated in both V1 (percentage-based) and V2 (token-based) engine paths and emitted during applySaturationDecisions.
…ete hook - Add analyzer_version label to wva_required_capacity to disambiguate V1 (binary 0/1) from V2 (continuous token demand) units. Add AnalyzerVersion field to VariantDecision; set "v1" in enrichDecisionsFromReplicaMetrics and "v2" in enrichDecisionsWithKvTokenData. - Add AnalyzerVersionV1/V2 constants and LabelAnalyzerVersion constant. - Key V2 KV-token aggregation by (modelID, variantName) instead of just variantName; variant names can collide across models in the same cycle. - Add MetricsEmitter.DeleteSaturationMetrics() so the controller delete handler can remove stale time series when a VariantAutoscaling is deleted. - Update tests: cover V1/V2 label distinction, Delete behavior, and analyzer version on controller_instance test.
499a0b6 to
0707e1c
Compare
|
/ok-to-test |
|
🚀 Kind E2E (full) triggered by |
|
🚀 OpenShift E2E — approve and run ( |
| decision.SpareCapacity, | ||
| decision.RequiredCapacity, | ||
| decision.KvCacheTokensUsed, | ||
| decision.KvCacheTokensTotal, |
There was a problem hiding this comment.
For me, it's clearer if we rename to KvCacheTokensCapacity , and that would also align with SpareCapacity and RequiredCapacity. "capacity" is already used in help message "Total KV cache token capacity across all replicas of a variant"
| saturationUtilization = prometheus.NewGaugeVec( | ||
| prometheus.GaugeOpts{ | ||
| Name: constants.WVASaturationUtilization, | ||
| Help: "Per-variant utilization ratio (0.0-1.0) from saturation analysis", |
There was a problem hiding this comment.
do we want to be more specific with the help message, .e.g. cpu, gpu, kv cache utilization?
| spareCapacity = prometheus.NewGaugeVec( | ||
| prometheus.GaugeOpts{ | ||
| Name: constants.WVASpareCapacity, | ||
| Help: "Per-variant spare capacity (0.0-1.0) from saturation analysis", |
| requiredCapacity = prometheus.NewGaugeVec( | ||
| prometheus.GaugeOpts{ | ||
| Name: constants.WVARequiredCapacity, | ||
| Help: "Model-level required capacity; >0 indicates scale-up needed. Use the analyzer_version label to distinguish units (V1: binary 0/1, V2: continuous token demand).", |
There was a problem hiding this comment.
can we format this string using constants.LabelAnalyzerVersion?
| // Analyzer version label values used in saturation metrics. | ||
| const ( | ||
| AnalyzerVersionV1 = "v1" | ||
| AnalyzerVersionV2 = "v2" |
There was a problem hiding this comment.
Users have to remember what "v1" and "v2" mean, and translate "v1" to binary unit, "v2" to continuous , may be a label "unit" or "analyzer_unit" of "binary"/"continuous" is more direct?
There was a problem hiding this comment.
In addition, we also use "const SaturationAnalyzerName = "saturation" in code, and in configmap to distinguish v1 and v2, should we reuse?
There was a problem hiding this comment.
Actually I see few more differences between v1 and v2 below so feel free to ignore these comments.
| MinReplicas: state.MinReplicas, | ||
| MaxReplicas: state.MaxReplicas, | ||
| Utilization: vc.Utilization, | ||
| SpareCapacity: 1.0 - vc.Utilization, |
There was a problem hiding this comment.
If spareCapacity is always 1.0 - Utilization then do we need spareCapacity?
There was a problem hiding this comment.
Not redundant today, but it will be after V1 is removed. Here's the current state:
- V1 path (
enrichDecisionsFromReplicaMetricsinengine.go:764-773):SpareCapacityis set fromAvgSpareKvCapacitywhich is threshold-relative (kvCacheThreshold - avgKvUsage). That is not equal to1.0 - Utilization. The two semantics differ: V1SpareCapacity=0means "at threshold" (e.g., 80% used), while V1Utilization=0would mean "nothing used at all." - V2 path (
cost_aware_optimizer.go:308):SpareCapacity = 1.0 - vc.Utilization, so yes, it's derivable.
The field doc comment at interfaces/saturation_analyzer.go:187-191 already documents both semantics:
V1: threshold-relative spare KV capacity (AvgSpareKvCapacity).
V2: 1.0 - Utilization (absolute spare).
Once V1 is deprecated and removed, SpareCapacity in V2 will be exactly 1.0 - Utilization and the field (and the wva_spare_capacity metric) becomes redundant. At that point I'd propose:
- Either drop the field from
VariantDecisionand have consumers compute1 - Utilizationin PromQL (wva_spare_capacity→ derived fromwva_saturation_utilization) - Or keep the metric as a convenience (avoids PromQL boilerplate in dashboards)
For this PR, keeping SpareCapacity preserves current V1 semantics and matches field-level documentation. I'll file a follow-up issue to reevaluate once V1 is removed.
1297030 to
33fb956
Compare
…abel - Rename KvCacheTokensTotal -> KvCacheTokensCapacity on VariantDecision, Actuator.EmitSaturationMetrics, MetricsEmitter.EmitSaturationMetrics, DeleteSaturationMetrics, and the Prometheus metric name itself (wva_kv_cache_tokens_total -> wva_kv_cache_tokens_capacity). "Total" was confusing — the metric is a gauge of capacity, not a cumulative counter. - Replace the analyzer_version="v1"/"v2" label on wva_required_capacity with a unit="binary"/"continuous" label. The label's purpose is to describe the unit of the metric value (a boolean scale-up signal in V1, a continuous token demand in V2), not the code path that produced it. "binary"/"continuous" remains meaningful after V1 is deprecated, whereas "v1"/"v2" becomes vestigial. Rename VariantDecision.AnalyzerVersion -> RequiredCapacityUnit. Rename constants.LabelAnalyzerVersion -> LabelUnit. Rename constants.AnalyzerVersionV1/V2 -> UnitBinary/UnitContinuous. - Expand help strings on wva_saturation_utilization, wva_spare_capacity, wva_kv_cache_tokens_used, and wva_kv_cache_tokens_capacity to specify what is being measured (KV-cache) and how V1 vs V2 paths differ. - Use constants.LabelUnit, UnitBinary, UnitContinuous in the wva_required_capacity help string via fmt.Sprintf, for consistency with how labels are referenced elsewhere.
33fb956 to
cbefae0
Compare
|
/ok-to-test |
|
🚀 Kind E2E (full) triggered by |
|
🚀 OpenShift E2E — approve and run ( |
Summary
wva_saturation_utilization,wva_spare_capacity,wva_required_capacity,wva_kv_cache_tokens_used,wva_kv_cache_tokens_total) emitted per variant after each scaling decisionVariantDecisionwithUtilization,SpareCapacity,RequiredCapacity,KvCacheTokensUsed, andKvCacheTokensTotalfields for both V1 and V2 engine pathsEmitSaturationMetricsinapplySaturationDecisionsfor every variant with a scaling decisionDetails
New metrics (emitted per reconciliation cycle per variant):
wva_saturation_utilizationwva_spare_capacitywva_required_capacitywva_kv_cache_tokens_usedwva_kv_cache_tokens_totalV1 path:
enrichDecisionsFromReplicaMetricsaggregates per-podReplicaMetricsby variant to compute utilization (avg KV cache usage) and KV token sums.RequiredCapacityis 1.0 ifshouldScaleUp, else 0.0.V2 path:
Utilization,SpareCapacity, andRequiredCapacityare populated fromAnalyzerResultinbuildDecisionsWithOptimizer.enrichDecisionsWithKvTokenDataadds KV token sums fromReplicaMetrics.Files changed:
internal/constants/metrics.go— 5 new metric constant definitionsinternal/interfaces/saturation_analyzer.go— 4 new fields onVariantDecisioninternal/metrics/metrics.go— metric registration andEmitSaturationMetrics()internal/metrics/metrics_test.go— 236-line test suite (4 test cases)internal/engines/saturation/engine.go—enrichDecisionsFromReplicaMetrics,enrichDecisionsWithKvTokenData, emission call inapplySaturationDecisionsinternal/engines/pipeline/cost_aware_optimizer.go— populateUtilization,SpareCapacity,RequiredCapacityinbuildDecisionsWithOptimizerinternal/actuator/actuator.go—EmitSaturationMetricswrapper method