Add optimization loop performance metrics#981
Conversation
Add two Prometheus metrics to track optimization loop performance: - wva_optimization_duration_seconds: histogram tracking duration of each optimization cycle, labeled by status (success/error) - wva_models_processed_total: counter tracking total models processed Implementation: - internal/constants/metrics.go: add metric name and label constants - internal/metrics/metrics.go: register histogram and counter, add ObserveOptimizationDuration and IncrModelsProcessed helper methods - internal/engines/saturation/engine.go: instrument optimize() with deferred timing observation and model count tracking - internal/metrics/optimization_metrics_test.go: unit tests verifying histogram observation, counter increment, and nil safety Closes llm-d#914
|
Note on the If a |
|
/ok-to-test |
|
🚀 Kind E2E (full) triggered by |
|
🚀 OpenShift E2E — approve and run ( |
ev-shindin
left a comment
There was a problem hiding this comment.
Please address review comments
Changes based on review by @ev-shindin: 1. Use named return value (retErr) instead of manual optimizeErr assignments — idiomatic Go, captures all return paths automatically 2. Add controller_instance label to optimization duration histogram for HA deployment compatibility 3. Replace MetricsEmitter methods with package-level functions (ObserveOptimizationDuration, SetModelsProcessed) — avoids adding a second emitter instance on the Engine struct 4. Change models-processed from counter to gauge — directly shows models in last cycle, more useful for dashboards and alerting 5. Move modelsProcessed assignment to right after modelGroups is computed — accurately reflects models entering the pipeline regardless of apply outcome 6. Remove "partial" from status label documentation — not currently emitted, can be added when a partial completion path exists
|
/ok-to-test |
ev-shindin
left a comment
There was a problem hiding this comment.
Good progress. Please address review comments.
ev-shindin
left a comment
There was a problem hiding this comment.
Please address review comments
- Rename wva_models_processed_total to wva_models_processed (gauge, not counter — _total suffix violates Prometheus convention). - Convert modelsProcessedGauge to GaugeVec with conditional controller_instance label so HA deployments don't collide. - Always set modelsProcessedGauge in defer so the value reflects the current cycle even when optimize() returns early with no active VAs. - Bump test file copyright to 2026.
thanks for the comments! fixed them |
|
/lgtm please fix linter issues |
applySaturationDecisions logs and handles all errors inline, never returning a non-nil error. Remove the error return and simplify the caller in optimize(). Fixes the unparam lint failure. Signed-off-by: Jiazhou Gao <gjz140103@gmail.com>
bbdb5f9 to
7529833
Compare
|
/ok-to-test |
1 similar comment
|
/ok-to-test |
|
🚀 Kind E2E (full) triggered by |
|
🚀 OpenShift E2E — approve and run ( |
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
Summary
Add timing and throughput metrics for the optimization loop. Closes #914.
New Metrics
wva_optimization_duration_secondsstatuswva_models_processed_totalHistogram buckets:
{0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}secondsStatus labels:
success,errorImplementation
internal/constants/metrics.go— New metric name and label constantsinternal/metrics/metrics.go— Register histogram + counter inInitMetrics(), addObserveOptimizationDuration()andIncrModelsProcessed()methods onMetricsEmitterinternal/engines/saturation/engine.go— Instrumentoptimize()with a deferred timing observation that captures duration and status (success/error), and tracks models processed per cycleinternal/metrics/optimization_metrics_test.go— Unit testsTests
Acceptance Criteria (from issue)