Skip to content

Add optimization loop performance metrics#981

Open
jia-gao wants to merge 4 commits intollm-d:mainfrom
jia-gao:feat/optimization-loop-metrics-914
Open

Add optimization loop performance metrics#981
jia-gao wants to merge 4 commits intollm-d:mainfrom
jia-gao:feat/optimization-loop-metrics-914

Conversation

@jia-gao
Copy link
Copy Markdown
Contributor

@jia-gao jia-gao commented Apr 5, 2026

Summary

Add timing and throughput metrics for the optimization loop. Closes #914.

New Metrics

Metric Type Labels Description
wva_optimization_duration_seconds Histogram status Duration of each optimization cycle
wva_models_processed_total Counter Total models processed across cycles

Histogram buckets: {0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10} seconds

Status labels: success, error

Implementation

  • internal/constants/metrics.go — New metric name and label constants
  • internal/metrics/metrics.go — Register histogram + counter in InitMetrics(), add ObserveOptimizationDuration() and IncrModelsProcessed() methods on MetricsEmitter
  • internal/engines/saturation/engine.go — Instrument optimize() with a deferred timing observation that captures duration and status (success/error), and tracks models processed per cycle
  • internal/metrics/optimization_metrics_test.go — Unit tests

Tests

=== RUN   TestObserveOptimizationDuration
--- PASS: TestObserveOptimizationDuration (0.00s)
=== RUN   TestIncrModelsProcessed
--- PASS: TestIncrModelsProcessed (0.00s)
=== RUN   TestObserveOptimizationDuration_NilSafety
--- PASS: TestObserveOptimizationDuration_NilSafety (0.00s)
PASS

Acceptance Criteria (from issue)

  • Duration histogram emitted per optimization cycle
  • Models-processed counter incremented correctly
  • Unit tests verify histogram observation and counter increment

Add two Prometheus metrics to track optimization loop performance:
- wva_optimization_duration_seconds: histogram tracking duration of each
  optimization cycle, labeled by status (success/error)
- wva_models_processed_total: counter tracking total models processed

Implementation:
- internal/constants/metrics.go: add metric name and label constants
- internal/metrics/metrics.go: register histogram and counter, add
  ObserveOptimizationDuration and IncrModelsProcessed helper methods
- internal/engines/saturation/engine.go: instrument optimize() with
  deferred timing observation and model count tracking
- internal/metrics/optimization_metrics_test.go: unit tests verifying
  histogram observation, counter increment, and nil safety

Closes llm-d#914
@jia-gao
Copy link
Copy Markdown
Contributor Author

jia-gao commented Apr 5, 2026

Note on the status label: The issue spec mentions partial as a possible status value. The current implementation only emits success or error because optimize() either completes fully or returns an error — there's no partial completion path today.

If a partial status is needed in the future (e.g., when some model groups succeed but others fail within a single cycle), it can be added without breaking the histogram schema since it's just a new label value.

@ev-shindin
Copy link
Copy Markdown
Collaborator

/ok-to-test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

Copy link
Copy Markdown
Collaborator

@ev-shindin ev-shindin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address review comments

Comment thread internal/engines/saturation/engine.go Outdated
Comment thread internal/metrics/metrics.go Outdated
Comment thread internal/engines/saturation/engine.go Outdated
Comment thread internal/constants/metrics.go Outdated
Comment thread internal/engines/saturation/engine.go Outdated
Comment thread internal/metrics/metrics.go Outdated
Changes based on review by @ev-shindin:

1. Use named return value (retErr) instead of manual optimizeErr
   assignments — idiomatic Go, captures all return paths automatically
2. Add controller_instance label to optimization duration histogram
   for HA deployment compatibility
3. Replace MetricsEmitter methods with package-level functions
   (ObserveOptimizationDuration, SetModelsProcessed) — avoids adding
   a second emitter instance on the Engine struct
4. Change models-processed from counter to gauge — directly shows
   models in last cycle, more useful for dashboards and alerting
5. Move modelsProcessed assignment to right after modelGroups is
   computed — accurately reflects models entering the pipeline
   regardless of apply outcome
6. Remove "partial" from status label documentation — not currently
   emitted, can be added when a partial completion path exists
@jia-gao jia-gao requested a review from ev-shindin April 6, 2026 17:04
@jia-gao
Copy link
Copy Markdown
Contributor Author

jia-gao commented Apr 12, 2026

/ok-to-test

Copy link
Copy Markdown
Collaborator

@ev-shindin ev-shindin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress. Please address review comments.

Comment thread internal/metrics/metrics.go Outdated
Comment thread internal/constants/metrics.go Outdated
Comment thread internal/engines/saturation/engine.go Outdated
Comment thread internal/metrics/optimization_metrics_test.go Outdated
@ev-shindin ev-shindin self-requested a review April 13, 2026 09:11
Copy link
Copy Markdown
Collaborator

@ev-shindin ev-shindin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address review comments

- Rename wva_models_processed_total to wva_models_processed (gauge,
  not counter — _total suffix violates Prometheus convention).
- Convert modelsProcessedGauge to GaugeVec with conditional
  controller_instance label so HA deployments don't collide.
- Always set modelsProcessedGauge in defer so the value reflects the
  current cycle even when optimize() returns early with no active VAs.
- Bump test file copyright to 2026.
@jia-gao
Copy link
Copy Markdown
Contributor Author

jia-gao commented Apr 13, 2026

Please address review comments

thanks for the comments! fixed them

@ev-shindin
Copy link
Copy Markdown
Collaborator

/lgtm please fix linter issues

@github-actions github-actions bot added the lgtm Looks good to me, indicates that a PR is ready to be merged. label Apr 14, 2026
applySaturationDecisions logs and handles all errors inline, never
returning a non-nil error. Remove the error return and simplify the
caller in optimize(). Fixes the unparam lint failure.

Signed-off-by: Jiazhou Gao <gjz140103@gmail.com>
@jia-gao jia-gao force-pushed the feat/optimization-loop-metrics-914 branch from bbdb5f9 to 7529833 Compare April 14, 2026 22:16
@jia-gao
Copy link
Copy Markdown
Contributor Author

jia-gao commented Apr 15, 2026

/ok-to-test

1 similar comment
@ev-shindin
Copy link
Copy Markdown
Collaborator

/ok-to-test

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 39 11
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm Looks good to me, indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimization Loop Performance Metrics

2 participants