Feat: Add saturation and capacity Prometheus metrics for WVA observability by ev-shindin · Pull Request #933 · llm-d/llm-d-workload-variant-autoscaler

ev-shindin · 2026-03-25T12:07:03Z

Summary

Add 5 new Prometheus gauges (wva_saturation_utilization, wva_spare_capacity, wva_required_capacity, wva_kv_cache_tokens_used, wva_kv_cache_tokens_total) emitted per variant after each scaling decision
Enrich VariantDecision with Utilization, SpareCapacity, RequiredCapacity, KvCacheTokensUsed, and KvCacheTokensTotal fields for both V1 and V2 engine paths
Emit metrics via EmitSaturationMetrics in applySaturationDecisions for every variant with a scaling decision

Details

New metrics (emitted per reconciliation cycle per variant):

Metric	Labels	Description
`wva_saturation_utilization`	variant_name, namespace, accelerator_type	Utilization ratio (0.0-1.0)
`wva_spare_capacity`	variant_name, namespace, accelerator_type	Spare capacity (0.0-1.0)
`wva_required_capacity`	variant_name, namespace	Model-level required capacity (>0 = scale-up needed)
`wva_kv_cache_tokens_used`	variant_name, namespace	Sum of KV cache tokens in use
`wva_kv_cache_tokens_total`	variant_name, namespace	Sum of KV cache token capacity

V1 path: enrichDecisionsFromReplicaMetrics aggregates per-pod ReplicaMetrics by variant to compute utilization (avg KV cache usage) and KV token sums. RequiredCapacity is 1.0 if shouldScaleUp, else 0.0.

V2 path: Utilization, SpareCapacity, and RequiredCapacity are populated from AnalyzerResult in buildDecisionsWithOptimizer. enrichDecisionsWithKvTokenData adds KV token sums from ReplicaMetrics.

Files changed:

internal/constants/metrics.go — 5 new metric constant definitions
internal/interfaces/saturation_analyzer.go — 4 new fields on VariantDecision
internal/metrics/metrics.go — metric registration and EmitSaturationMetrics()
internal/metrics/metrics_test.go — 236-line test suite (4 test cases)
internal/engines/saturation/engine.go — enrichDecisionsFromReplicaMetrics, enrichDecisionsWithKvTokenData, emission call in applySaturationDecisions
internal/engines/pipeline/cost_aware_optimizer.go — populate Utilization, SpareCapacity, RequiredCapacity in buildDecisionsWithOptimizer
internal/actuator/actuator.go — EmitSaturationMetrics wrapper method

ev-shindin · 2026-03-25T16:20:43Z

/trigger-e2e-full

github-actions · 2026-03-25T16:20:54Z

🚀 Kind E2E (full) triggered by /trigger-e2e-full

View the Kind E2E workflow run

ev-shindin · 2026-03-25T16:23:04Z

/ok-to-test

github-actions · 2026-03-25T16:23:15Z

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

github-actions · 2026-03-25T16:25:50Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	23	27

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

ev-shindin · 2026-04-14T14:33:08Z

/ok-to-test

github-actions · 2026-04-14T14:33:21Z

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

github-actions · 2026-04-14T14:33:28Z

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

github-actions · 2026-04-14T14:35:49Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	42	8

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

github-actions · 2026-04-14T14:39:05Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	42	8

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

github-actions · 2026-04-14T14:58:09Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	42	8

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

ev-shindin · 2026-04-14T19:48:54Z

/ok-to-test

github-actions · 2026-04-14T19:49:12Z

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

github-actions · 2026-04-14T19:49:17Z

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

lionelvillard · 2026-04-14T20:02:46Z

@ev-shindin the Openshift tests are failing. Can you PTAL?

lionelvillard · 2026-04-14T20:03:51Z

@shuynh2017 can you PTAL?

ev-shindin · 2026-04-15T04:27:09Z

/ok-to-test

github-actions · 2026-04-15T04:27:19Z

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

github-actions · 2026-04-15T04:27:33Z

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

Add 5 new Prometheus gauge metrics exposing saturation analysis outputs that drive scaling decisions, giving operators visibility into why scaling happens: - wva_saturation_utilization: per-variant utilization ratio (0.0-1.0) - wva_spare_capacity: per-variant spare capacity (0.0-1.0) - wva_required_capacity: model-level required capacity (>0 = scale-up) - wva_kv_cache_tokens_used: KV cache tokens in use per variant - wva_kv_cache_tokens_total: KV cache token capacity per variant Metrics are populated in both V1 (percentage-based) and V2 (token-based) engine paths and emitted during applySaturationDecisions.

…ete hook - Add analyzer_version label to wva_required_capacity to disambiguate V1 (binary 0/1) from V2 (continuous token demand) units. Add AnalyzerVersion field to VariantDecision; set "v1" in enrichDecisionsFromReplicaMetrics and "v2" in enrichDecisionsWithKvTokenData. - Add AnalyzerVersionV1/V2 constants and LabelAnalyzerVersion constant. - Key V2 KV-token aggregation by (modelID, variantName) instead of just variantName; variant names can collide across models in the same cycle. - Add MetricsEmitter.DeleteSaturationMetrics() so the controller delete handler can remove stale time series when a VariantAutoscaling is deleted. - Update tests: cover V1/V2 label distinction, Delete behavior, and analyzer version on controller_instance test.

ev-shindin · 2026-04-15T08:08:10Z

/ok-to-test

github-actions · 2026-04-15T08:08:22Z

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

github-actions · 2026-04-15T08:08:26Z

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

shuynh2017 · 2026-04-15T11:25:24Z

+		decision.SpareCapacity,
+		decision.RequiredCapacity,
+		decision.KvCacheTokensUsed,
+		decision.KvCacheTokensTotal,


For me, it's clearer if we rename to KvCacheTokensCapacity , and that would also align with SpareCapacity and RequiredCapacity. "capacity" is already used in help message "Total KV cache token capacity across all replicas of a variant"

shuynh2017 · 2026-04-15T11:41:22Z

+	saturationUtilization = prometheus.NewGaugeVec(
+		prometheus.GaugeOpts{
+			Name: constants.WVASaturationUtilization,
+			Help: "Per-variant utilization ratio (0.0-1.0) from saturation analysis",


do we want to be more specific with the help message, .e.g. cpu, gpu, kv cache utilization?

shuynh2017 · 2026-04-15T11:41:42Z

+	spareCapacity = prometheus.NewGaugeVec(
+		prometheus.GaugeOpts{
+			Name: constants.WVASpareCapacity,
+			Help: "Per-variant spare capacity (0.0-1.0) from saturation analysis",


same comment ^^

shuynh2017 · 2026-04-15T11:45:17Z

+	requiredCapacity = prometheus.NewGaugeVec(
+		prometheus.GaugeOpts{
+			Name: constants.WVARequiredCapacity,
+			Help: "Model-level required capacity; >0 indicates scale-up needed. Use the analyzer_version label to distinguish units (V1: binary 0/1, V2: continuous token demand).",


can we format this string using constants.LabelAnalyzerVersion?

shuynh2017 · 2026-04-15T12:15:57Z

+// Analyzer version label values used in saturation metrics.
+const (
+	AnalyzerVersionV1 = "v1"
+	AnalyzerVersionV2 = "v2"


Users have to remember what "v1" and "v2" mean, and translate "v1" to binary unit, "v2" to continuous , may be a label "unit" or "analyzer_unit" of "binary"/"continuous" is more direct?

In addition, we also use "const SaturationAnalyzerName = "saturation" in code, and in configmap to distinguish v1 and v2, should we reuse?

Actually I see few more differences between v1 and v2 below so feel free to ignore these comments.

shuynh2017 · 2026-04-15T12:31:44Z

+			MinReplicas:      state.MinReplicas,
+			MaxReplicas:      state.MaxReplicas,
+			Utilization:      vc.Utilization,
+			SpareCapacity:    1.0 - vc.Utilization,


If spareCapacity is always 1.0 - Utilization then do we need spareCapacity?

Not redundant today, but it will be after V1 is removed. Here's the current state:

V1 path (enrichDecisionsFromReplicaMetrics in engine.go:764-773): SpareCapacity is set from AvgSpareKvCapacity which is threshold-relative (kvCacheThreshold - avgKvUsage). That is not equal to 1.0 - Utilization. The two semantics differ: V1 SpareCapacity=0 means "at threshold" (e.g., 80% used), while V1 Utilization=0 would mean "nothing used at all."

V2 path (cost_aware_optimizer.go:308): SpareCapacity = 1.0 - vc.Utilization, so yes, it's derivable.

The field doc comment at interfaces/saturation_analyzer.go:187-191 already documents both semantics:

V1: threshold-relative spare KV capacity (AvgSpareKvCapacity).
V2: 1.0 - Utilization (absolute spare).

Once V1 is deprecated and removed, SpareCapacity in V2 will be exactly 1.0 - Utilization and the field (and the wva_spare_capacity metric) becomes redundant. At that point I'd propose:

Either drop the field from VariantDecision and have consumers compute 1 - Utilization in PromQL (wva_spare_capacity → derived from wva_saturation_utilization)

Or keep the metric as a convenience (avoids PromQL boilerplate in dashboards)

For this PR, keeping SpareCapacity preserves current V1 semantics and matches field-level documentation. I'll file a follow-up issue to reevaluate once V1 is removed.

…abel - Rename KvCacheTokensTotal -> KvCacheTokensCapacity on VariantDecision, Actuator.EmitSaturationMetrics, MetricsEmitter.EmitSaturationMetrics, DeleteSaturationMetrics, and the Prometheus metric name itself (wva_kv_cache_tokens_total -> wva_kv_cache_tokens_capacity). "Total" was confusing — the metric is a gauge of capacity, not a cumulative counter. - Replace the analyzer_version="v1"/"v2" label on wva_required_capacity with a unit="binary"/"continuous" label. The label's purpose is to describe the unit of the metric value (a boolean scale-up signal in V1, a continuous token demand in V2), not the code path that produced it. "binary"/"continuous" remains meaningful after V1 is deprecated, whereas "v1"/"v2" becomes vestigial. Rename VariantDecision.AnalyzerVersion -> RequiredCapacityUnit. Rename constants.LabelAnalyzerVersion -> LabelUnit. Rename constants.AnalyzerVersionV1/V2 -> UnitBinary/UnitContinuous. - Expand help strings on wva_saturation_utilization, wva_spare_capacity, wva_kv_cache_tokens_used, and wva_kv_cache_tokens_capacity to specify what is being measured (KV-cache) and how V1 vs V2 paths differ. - Use constants.LabelUnit, UnitBinary, UnitContinuous in the wva_required_capacity help string via fmt.Sprintf, for consistency with how labels are referenced elsewhere.

ev-shindin · 2026-04-16T07:10:10Z

/ok-to-test

github-actions · 2026-04-16T07:10:23Z

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

github-actions · 2026-04-16T07:10:30Z

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

ev-shindin self-assigned this Mar 25, 2026

ev-shindin linked an issue Mar 25, 2026 that may be closed by this pull request

Saturation and Capacity Metrics #912

Open

3 tasks

ev-shindin requested a review from lionelvillard March 25, 2026 12:08

ev-shindin force-pushed the feat/saturation-capacity-metrics branch 2 times, most recently from 9f69455 to b63fd32 Compare March 25, 2026 16:01

ev-shindin requested a review from mamy-CS March 30, 2026 15:29

ev-shindin requested a review from shuynh2017 April 14, 2026 11:54

ev-shindin added 3 commits April 15, 2026 09:47

Fix perfsprint lint: use errors.New for static message

0707e1c

ev-shindin force-pushed the feat/saturation-capacity-metrics branch from 499a0b6 to 0707e1c Compare April 15, 2026 06:49

shuynh2017 reviewed Apr 15, 2026

View reviewed changes

ev-shindin force-pushed the feat/saturation-capacity-metrics branch 3 times, most recently from 1297030 to 33fb956 Compare April 15, 2026 13:54

ev-shindin force-pushed the feat/saturation-capacity-metrics branch from 33fb956 to cbefae0 Compare April 15, 2026 14:02

ev-shindin requested a review from shuynh2017 April 15, 2026 15:46

Conversation

ev-shindin commented Mar 25, 2026

Summary

Details

Uh oh!

ev-shindin commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

ev-shindin commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

GPU Pre-flight Check ✅

Uh oh!

ev-shindin commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

GPU Pre-flight Check ✅

Uh oh!

github-actions bot commented Apr 14, 2026

GPU Pre-flight Check ✅

Uh oh!

github-actions bot commented Apr 14, 2026

GPU Pre-flight Check ✅

Uh oh!

ev-shindin commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

lionelvillard commented Apr 14, 2026

Uh oh!

lionelvillard commented Apr 14, 2026

Uh oh!

ev-shindin commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

ev-shindin commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

shuynh2017 Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ev-shindin commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

Reviewers

shuynh2017 Apr 15, 2026 •

edited

Loading