Add optimization loop performance metrics by jia-gao · Pull Request #981 · llm-d/llm-d-workload-variant-autoscaler

jia-gao · 2026-04-05T17:12:30Z

Summary

Add timing and throughput metrics for the optimization loop. Closes #914.

New Metrics

Metric	Type	Labels	Description
`wva_optimization_duration_seconds`	Histogram	`status`	Duration of each optimization cycle
`wva_models_processed_total`	Counter	—	Total models processed across cycles

Histogram buckets: {0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10} seconds

Status labels: success, error

Implementation

internal/constants/metrics.go — New metric name and label constants
internal/metrics/metrics.go — Register histogram + counter in InitMetrics(), add ObserveOptimizationDuration() and IncrModelsProcessed() methods on MetricsEmitter
internal/engines/saturation/engine.go — Instrument optimize() with a deferred timing observation that captures duration and status (success/error), and tracks models processed per cycle
internal/metrics/optimization_metrics_test.go — Unit tests

Tests

=== RUN   TestObserveOptimizationDuration
--- PASS: TestObserveOptimizationDuration (0.00s)
=== RUN   TestIncrModelsProcessed
--- PASS: TestIncrModelsProcessed (0.00s)
=== RUN   TestObserveOptimizationDuration_NilSafety
--- PASS: TestObserveOptimizationDuration_NilSafety (0.00s)
PASS

Acceptance Criteria (from issue)

Duration histogram emitted per optimization cycle
Models-processed counter incremented correctly
Unit tests verify histogram observation and counter increment

Add two Prometheus metrics to track optimization loop performance: - wva_optimization_duration_seconds: histogram tracking duration of each optimization cycle, labeled by status (success/error) - wva_models_processed_total: counter tracking total models processed Implementation: - internal/constants/metrics.go: add metric name and label constants - internal/metrics/metrics.go: register histogram and counter, add ObserveOptimizationDuration and IncrModelsProcessed helper methods - internal/engines/saturation/engine.go: instrument optimize() with deferred timing observation and model count tracking - internal/metrics/optimization_metrics_test.go: unit tests verifying histogram observation, counter increment, and nil safety Closes llm-d#914

jia-gao · 2026-04-05T17:22:40Z

Note on the status label: The issue spec mentions partial as a possible status value. The current implementation only emits success or error because optimize() either completes fully or returns an error — there's no partial completion path today.

If a partial status is needed in the future (e.g., when some model groups succeed but others fail within a single cycle), it can be added without breaking the histogram schema since it's just a new label value.

ev-shindin · 2026-04-05T20:15:54Z

/ok-to-test

github-actions · 2026-04-05T20:16:04Z

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

github-actions · 2026-04-05T20:16:07Z

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

ev-shindin

Please address review comments

@ev-shindin

Changes based on review by @ev-shindin: 1. Use named return value (retErr) instead of manual optimizeErr assignments — idiomatic Go, captures all return paths automatically 2. Add controller_instance label to optimization duration histogram for HA deployment compatibility 3. Replace MetricsEmitter methods with package-level functions (ObserveOptimizationDuration, SetModelsProcessed) — avoids adding a second emitter instance on the Engine struct 4. Change models-processed from counter to gauge — directly shows models in last cycle, more useful for dashboards and alerting 5. Move modelsProcessed assignment to right after modelGroups is computed — accurately reflects models entering the pipeline regardless of apply outcome 6. Remove "partial" from status label documentation — not currently emitted, can be added when a partial completion path exists

jia-gao · 2026-04-12T19:29:45Z

/ok-to-test

ev-shindin

Good progress. Please address review comments.

ev-shindin

Please address review comments

- Rename wva_models_processed_total to wva_models_processed (gauge, not counter — _total suffix violates Prometheus convention). - Convert modelsProcessedGauge to GaugeVec with conditional controller_instance label so HA deployments don't collide. - Always set modelsProcessedGauge in defer so the value reflects the current cycle even when optimize() returns early with no active VAs. - Bump test file copyright to 2026.

jia-gao · 2026-04-13T16:02:26Z

Please address review comments

thanks for the comments! fixed them

ev-shindin · 2026-04-14T18:20:23Z

/lgtm please fix linter issues

applySaturationDecisions logs and handles all errors inline, never returning a non-nil error. Remove the error return and simplify the caller in optimize(). Fixes the unparam lint failure. Signed-off-by: Jiazhou Gao <gjz140103@gmail.com>

jia-gao · 2026-04-15T00:57:36Z

/ok-to-test

ev-shindin · 2026-04-15T04:25:24Z

/ok-to-test

github-actions · 2026-04-15T04:25:32Z

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

github-actions · 2026-04-15T04:25:42Z

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

github-actions · 2026-04-15T14:31:47Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	39	11

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

ev-shindin requested changes Apr 5, 2026

View reviewed changes

jia-gao requested a review from ev-shindin April 6, 2026 17:04

ev-shindin approved these changes Apr 13, 2026

View reviewed changes

Comment thread internal/metrics/metrics.go Outdated

Comment thread internal/constants/metrics.go Outdated

Comment thread internal/engines/saturation/engine.go Outdated

Comment thread internal/metrics/optimization_metrics_test.go Outdated

ev-shindin self-requested a review April 13, 2026 09:11

ev-shindin requested changes Apr 13, 2026

View reviewed changes

github-actions bot added the lgtm Looks good to me, indicates that a PR is ready to be merged. label Apr 14, 2026

jia-gao force-pushed the feat/optimization-loop-metrics-914 branch from bbdb5f9 to 7529833 Compare April 14, 2026 22:16

Conversation

jia-gao commented Apr 5, 2026

Summary

New Metrics

Implementation

Tests

Acceptance Criteria (from issue)

Uh oh!

jia-gao commented Apr 5, 2026

Uh oh!

ev-shindin commented Apr 5, 2026

Uh oh!

github-actions bot commented Apr 5, 2026

Uh oh!

github-actions bot commented Apr 5, 2026

Uh oh!

ev-shindin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jia-gao commented Apr 12, 2026

Uh oh!

ev-shindin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ev-shindin left a comment

Choose a reason for hiding this comment

Uh oh!

jia-gao commented Apr 13, 2026

Uh oh!

ev-shindin commented Apr 14, 2026

Uh oh!

jia-gao commented Apr 15, 2026

Uh oh!

ev-shindin commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

GPU Pre-flight Check ✅

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants