OpenTelemetry integration: MVP by shuynh2017 · Pull Request #782 · llm-d/llm-d-workload-variant-autoscaler

shuynh2017 · 2026-02-25T12:56:48Z

This PR introduces tracing with OpenTelemetry (Otel) for WVA. Users can configure WVA to export traces (for now, future may be metrics, logs) to a backend (like Jaeger, Prometheus, Grafana Tempo) via the OTLP (OpenTelemetry) protocol.

At this time, there are only 2 traces. We can add more as needed. This doesn't yet support client tls to Otel collector. I will address this next. Here are some screenshots of using Jaeger as collector to view tracing of scaling decisions:

shuynh2017 · 2026-02-25T12:57:13Z

@Gregory-Pereira, @lionelvillard pls help to review. Thanks.

lionelvillard · 2026-02-25T13:58:28Z

@shuynh2017 @Gregory-Pereira do you have an issue/design doc for this feature? In particular have you compare tracing vs CRD status and metrics?

shuynh2017 · 2026-02-25T16:14:29Z

@lionelvillard, this PR introduces tracing using Otel which in addition to tracing, also provides metrics and logging. With Otel, users can have a consolidate view of telemetry of all components in a system (e.g. llmd, wva, etc ...). For each component, users can select to see the telemetry in the component based on label for example all the scaling decisions for a particular model or variant. For each decision, users can drill down to see the decision at different spots in the code, as well as the final metric being emitted. Tracing/span is only one use for now. In the future, we might want also to use metrics, loggings.

Signed-off-by: Sum Huynh <31661254+shuynh2017@users.noreply.github.com>

lionelvillard · 2026-02-25T17:27:13Z

Im very well aware of what otel is. I’m trying to understand the value-add provided by tracing. It only makes sense if the scaling engine can be abstractly decomposed in smaller steps, which it can. These steps need to be documented obviously.

Gregory-Pereira · 2026-02-25T19:28:42Z

Im very well aware of what otel is. I’m trying to understand the value-add provided by tracing. It only makes sense if the scaling engine can be abstractly decomposed in smaller steps, which it can. These steps need to be documented obviously.

From my perspective the value would be rolling up the metrics autoscaler scales on into related buckets. As we introduce more inputs on which to scale and our scaling logic gets more complicated in WVA this is going to be increasingly more difficult and important to pin down. When I envisioned this I have to admit part of the value for me would be defining places within the WVA logic where we could capture this from spans even if its not doing much. Additionally for things like scaling based on queuing or managing the queue signals I thought it would be helpful to boil the variety of Queuing places down into a single point for clarity between, EPP Queue / post request catch point + vLLM request queue + vLLM running requests, to goal being to identify and provide posterity around "why" or which factor(s) lead to the scaling decision.

Please let me know if I am totally off base on this. I can definitely understand that the request was vauge, and suggested without a formal design doc at the last minute, so if we need to go back to the drawing board we can, but another reason why we wanted this was to tie in with the projects global tracing theme which were pushing in the v0.5.1 release.

mamy-CS · 2026-02-26T15:27:44Z

e2e and lint failing, please check @shuynh2017

asm582 · 2026-02-26T15:59:30Z

@Gregory-Pereira Thanks, tracing will help. We need to define how it will be consumed. Currently, the request scheduler and WVA operate on different granularities, and on top of that, WVA performs global optimization. We need to document how to scale correlation across the e2e system with OTEL.

Gregory-Pereira · 2026-02-26T17:06:40Z

Understood, lets hold off on this until the next release and we can revisit the problem

asm582 · 2026-03-05T15:22:12Z

Removing this issue, as it is not a priority for v0.6.

github-actions · 2026-03-27T01:36:06Z

This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the lifecycle/stale label.

shuynh2017 · 2026-03-27T12:32:18Z

keep it open

OpenTelemetry integration: MVP

36587f8

shuynh2017 and others added 2 commits February 25, 2026 11:16

Merge branch 'main' into opentelemetry-mvp

4509d23

Signed-off-by: Sum Huynh <31661254+shuynh2017@users.noreply.github.com>

fix merge issue

1b594bb

add go.sum

475e3c4

asm582 added this to the v0.6.0 milestone Feb 26, 2026

add tls support

5daff77

mamy-CS assigned shuynh2017 Mar 2, 2026

asm582 removed this from the v0.6.0 milestone Mar 5, 2026

github-actions bot added the lifecycle/stale label Mar 27, 2026

github-actions bot removed the lifecycle/stale label Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenTelemetry integration: MVP#782

OpenTelemetry integration: MVP#782
shuynh2017 wants to merge 5 commits intollm-d:mainfrom
shuynh2017:opentelemetry-mvp

shuynh2017 commented Feb 25, 2026 •

edited

Loading

Uh oh!

shuynh2017 commented Feb 25, 2026 •

edited

Loading

Uh oh!

lionelvillard commented Feb 25, 2026

Uh oh!

shuynh2017 commented Feb 25, 2026

Uh oh!

lionelvillard commented Feb 25, 2026

Uh oh!

Gregory-Pereira commented Feb 25, 2026 •

edited

Loading

Uh oh!

mamy-CS commented Feb 26, 2026

Uh oh!

asm582 commented Feb 26, 2026

Uh oh!

Gregory-Pereira commented Feb 26, 2026

Uh oh!

asm582 commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

shuynh2017 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

shuynh2017 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shuynh2017 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lionelvillard commented Feb 25, 2026

Uh oh!

shuynh2017 commented Feb 25, 2026

Uh oh!

lionelvillard commented Feb 25, 2026

Uh oh!

Gregory-Pereira commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mamy-CS commented Feb 26, 2026

Uh oh!

asm582 commented Feb 26, 2026

Uh oh!

Gregory-Pereira commented Feb 26, 2026

Uh oh!

asm582 commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

shuynh2017 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shuynh2017 commented Feb 25, 2026 •

edited

Loading

shuynh2017 commented Feb 25, 2026 •

edited

Loading

Gregory-Pereira commented Feb 25, 2026 •

edited

Loading