Skip to content

Feat: Add saturation and capacity Prometheus metrics for WVA observability#933

Open
ev-shindin wants to merge 4 commits intollm-d:mainfrom
ev-shindin:feat/saturation-capacity-metrics
Open

Feat: Add saturation and capacity Prometheus metrics for WVA observability#933
ev-shindin wants to merge 4 commits intollm-d:mainfrom
ev-shindin:feat/saturation-capacity-metrics

Conversation

@ev-shindin
Copy link
Copy Markdown
Collaborator

Summary

  • Add 5 new Prometheus gauges (wva_saturation_utilization, wva_spare_capacity, wva_required_capacity, wva_kv_cache_tokens_used, wva_kv_cache_tokens_total) emitted per variant after each scaling decision
  • Enrich VariantDecision with Utilization, SpareCapacity, RequiredCapacity, KvCacheTokensUsed, and KvCacheTokensTotal fields for both V1 and V2 engine paths
  • Emit metrics via EmitSaturationMetrics in applySaturationDecisions for every variant with a scaling decision

Details

New metrics (emitted per reconciliation cycle per variant):

Metric Labels Description
wva_saturation_utilization variant_name, namespace, accelerator_type Utilization ratio (0.0-1.0)
wva_spare_capacity variant_name, namespace, accelerator_type Spare capacity (0.0-1.0)
wva_required_capacity variant_name, namespace Model-level required capacity (>0 = scale-up needed)
wva_kv_cache_tokens_used variant_name, namespace Sum of KV cache tokens in use
wva_kv_cache_tokens_total variant_name, namespace Sum of KV cache token capacity

V1 path: enrichDecisionsFromReplicaMetrics aggregates per-pod ReplicaMetrics by variant to compute utilization (avg KV cache usage) and KV token sums. RequiredCapacity is 1.0 if shouldScaleUp, else 0.0.

V2 path: Utilization, SpareCapacity, and RequiredCapacity are populated from AnalyzerResult in buildDecisionsWithOptimizer. enrichDecisionsWithKvTokenData adds KV token sums from ReplicaMetrics.

Files changed:

  • internal/constants/metrics.go — 5 new metric constant definitions
  • internal/interfaces/saturation_analyzer.go — 4 new fields on VariantDecision
  • internal/metrics/metrics.go — metric registration and EmitSaturationMetrics()
  • internal/metrics/metrics_test.go — 236-line test suite (4 test cases)
  • internal/engines/saturation/engine.goenrichDecisionsFromReplicaMetrics, enrichDecisionsWithKvTokenData, emission call in applySaturationDecisions
  • internal/engines/pipeline/cost_aware_optimizer.go — populate Utilization, SpareCapacity, RequiredCapacity in buildDecisionsWithOptimizer
  • internal/actuator/actuator.goEmitSaturationMetrics wrapper method

@ev-shindin ev-shindin self-assigned this Mar 25, 2026
@ev-shindin ev-shindin linked an issue Mar 25, 2026 that may be closed by this pull request
3 tasks
@ev-shindin ev-shindin force-pushed the feat/saturation-capacity-metrics branch 2 times, most recently from 9f69455 to b63fd32 Compare March 25, 2026 16:01
@ev-shindin
Copy link
Copy Markdown
Collaborator Author

/trigger-e2e-full

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Kind E2E (full) triggered by /trigger-e2e-full

View the Kind E2E workflow run

@ev-shindin
Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@github-actions
Copy link
Copy Markdown
Contributor

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 23 27
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

@ev-shindin ev-shindin requested a review from mamy-CS March 30, 2026 15:29
@ev-shindin ev-shindin requested a review from shuynh2017 April 14, 2026 11:54
@ev-shindin
Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 42 8
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

2 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 42 8
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

@github-actions
Copy link
Copy Markdown
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 42 8
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

@ev-shindin
Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@github-actions
Copy link
Copy Markdown
Contributor

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

@lionelvillard
Copy link
Copy Markdown
Collaborator

@ev-shindin the Openshift tests are failing. Can you PTAL?

@lionelvillard
Copy link
Copy Markdown
Collaborator

@shuynh2017 can you PTAL?

@ev-shindin
Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

Add 5 new Prometheus gauge metrics exposing saturation analysis outputs
that drive scaling decisions, giving operators visibility into why
scaling happens:

- wva_saturation_utilization: per-variant utilization ratio (0.0-1.0)
- wva_spare_capacity: per-variant spare capacity (0.0-1.0)
- wva_required_capacity: model-level required capacity (>0 = scale-up)
- wva_kv_cache_tokens_used: KV cache tokens in use per variant
- wva_kv_cache_tokens_total: KV cache token capacity per variant

Metrics are populated in both V1 (percentage-based) and V2
(token-based) engine paths and emitted during applySaturationDecisions.
…ete hook

- Add analyzer_version label to wva_required_capacity to disambiguate V1 (binary
  0/1) from V2 (continuous token demand) units. Add AnalyzerVersion field to
  VariantDecision; set "v1" in enrichDecisionsFromReplicaMetrics and "v2" in
  enrichDecisionsWithKvTokenData.
- Add AnalyzerVersionV1/V2 constants and LabelAnalyzerVersion constant.
- Key V2 KV-token aggregation by (modelID, variantName) instead of just
  variantName; variant names can collide across models in the same cycle.
- Add MetricsEmitter.DeleteSaturationMetrics() so the controller delete handler
  can remove stale time series when a VariantAutoscaling is deleted.
- Update tests: cover V1/V2 label distinction, Delete behavior, and analyzer
  version on controller_instance test.
@ev-shindin ev-shindin force-pushed the feat/saturation-capacity-metrics branch from 499a0b6 to 0707e1c Compare April 15, 2026 06:49
@ev-shindin
Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

Comment thread internal/actuator/actuator.go Outdated
decision.SpareCapacity,
decision.RequiredCapacity,
decision.KvCacheTokensUsed,
decision.KvCacheTokensTotal,
Copy link
Copy Markdown
Collaborator

@shuynh2017 shuynh2017 Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, it's clearer if we rename to KvCacheTokensCapacity , and that would also align with SpareCapacity and RequiredCapacity. "capacity" is already used in help message "Total KV cache token capacity across all replicas of a variant"

Comment thread internal/metrics/metrics.go Outdated
saturationUtilization = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: constants.WVASaturationUtilization,
Help: "Per-variant utilization ratio (0.0-1.0) from saturation analysis",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to be more specific with the help message, .e.g. cpu, gpu, kv cache utilization?

Comment thread internal/metrics/metrics.go Outdated
spareCapacity = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: constants.WVASpareCapacity,
Help: "Per-variant spare capacity (0.0-1.0) from saturation analysis",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment ^^

Comment thread internal/metrics/metrics.go Outdated
requiredCapacity = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: constants.WVARequiredCapacity,
Help: "Model-level required capacity; >0 indicates scale-up needed. Use the analyzer_version label to distinguish units (V1: binary 0/1, V2: continuous token demand).",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we format this string using constants.LabelAnalyzerVersion?

Comment thread internal/constants/metrics.go Outdated
// Analyzer version label values used in saturation metrics.
const (
AnalyzerVersionV1 = "v1"
AnalyzerVersionV2 = "v2"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users have to remember what "v1" and "v2" mean, and translate "v1" to binary unit, "v2" to continuous , may be a label "unit" or "analyzer_unit" of "binary"/"continuous" is more direct?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, we also use "const SaturationAnalyzerName = "saturation" in code, and in configmap to distinguish v1 and v2, should we reuse?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I see few more differences between v1 and v2 below so feel free to ignore these comments.

MinReplicas: state.MinReplicas,
MaxReplicas: state.MaxReplicas,
Utilization: vc.Utilization,
SpareCapacity: 1.0 - vc.Utilization,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If spareCapacity is always 1.0 - Utilization then do we need spareCapacity?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not redundant today, but it will be after V1 is removed. Here's the current state:

  • V1 path (enrichDecisionsFromReplicaMetrics in engine.go:764-773): SpareCapacity is set from AvgSpareKvCapacity which is threshold-relative (kvCacheThreshold - avgKvUsage). That is not equal to 1.0 - Utilization. The two semantics differ: V1 SpareCapacity=0 means "at threshold" (e.g., 80% used), while V1 Utilization=0 would mean "nothing used at all."
  • V2 path (cost_aware_optimizer.go:308): SpareCapacity = 1.0 - vc.Utilization, so yes, it's derivable.

The field doc comment at interfaces/saturation_analyzer.go:187-191 already documents both semantics:

V1: threshold-relative spare KV capacity (AvgSpareKvCapacity).
V2: 1.0 - Utilization (absolute spare).

Once V1 is deprecated and removed, SpareCapacity in V2 will be exactly 1.0 - Utilization and the field (and the wva_spare_capacity metric) becomes redundant. At that point I'd propose:

  • Either drop the field from VariantDecision and have consumers compute 1 - Utilization in PromQL (wva_spare_capacity → derived from wva_saturation_utilization)
  • Or keep the metric as a convenience (avoids PromQL boilerplate in dashboards)

For this PR, keeping SpareCapacity preserves current V1 semantics and matches field-level documentation. I'll file a follow-up issue to reevaluate once V1 is removed.

@ev-shindin ev-shindin force-pushed the feat/saturation-capacity-metrics branch 3 times, most recently from 1297030 to 33fb956 Compare April 15, 2026 13:54
…abel

- Rename KvCacheTokensTotal -> KvCacheTokensCapacity on VariantDecision,
  Actuator.EmitSaturationMetrics, MetricsEmitter.EmitSaturationMetrics,
  DeleteSaturationMetrics, and the Prometheus metric name itself
  (wva_kv_cache_tokens_total -> wva_kv_cache_tokens_capacity). "Total" was
  confusing — the metric is a gauge of capacity, not a cumulative counter.

- Replace the analyzer_version="v1"/"v2" label on wva_required_capacity
  with a unit="binary"/"continuous" label. The label's purpose is to
  describe the unit of the metric value (a boolean scale-up signal in V1,
  a continuous token demand in V2), not the code path that produced it.
  "binary"/"continuous" remains meaningful after V1 is deprecated, whereas
  "v1"/"v2" becomes vestigial.

  Rename VariantDecision.AnalyzerVersion -> RequiredCapacityUnit.
  Rename constants.LabelAnalyzerVersion -> LabelUnit.
  Rename constants.AnalyzerVersionV1/V2 -> UnitBinary/UnitContinuous.

- Expand help strings on wva_saturation_utilization, wva_spare_capacity,
  wva_kv_cache_tokens_used, and wva_kv_cache_tokens_capacity to specify
  what is being measured (KV-cache) and how V1 vs V2 paths differ.

- Use constants.LabelUnit, UnitBinary, UnitContinuous in the
  wva_required_capacity help string via fmt.Sprintf, for consistency with
  how labels are referenced elsewhere.
@ev-shindin ev-shindin force-pushed the feat/saturation-capacity-metrics branch from 33fb956 to cbefae0 Compare April 15, 2026 14:02
@ev-shindin ev-shindin requested a review from shuynh2017 April 15, 2026 15:46
@ev-shindin
Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Saturation and Capacity Metrics

3 participants