Skip to content

OpenTelemetry integration: MVP#782

Open
shuynh2017 wants to merge 5 commits intollm-d:mainfrom
shuynh2017:opentelemetry-mvp
Open

OpenTelemetry integration: MVP#782
shuynh2017 wants to merge 5 commits intollm-d:mainfrom
shuynh2017:opentelemetry-mvp

Conversation

@shuynh2017
Copy link
Copy Markdown
Collaborator

@shuynh2017 shuynh2017 commented Feb 25, 2026

This PR introduces tracing with OpenTelemetry (Otel) for WVA. Users can configure WVA to export traces (for now, future may be metrics, logs) to a backend (like Jaeger, Prometheus, Grafana Tempo) via the OTLP (OpenTelemetry) protocol.

At this time, there are only 2 traces. We can add more as needed. This doesn't yet support client tls to Otel collector. I will address this next. Here are some screenshots of using Jaeger as collector to view tracing of scaling decisions:
image
image

@shuynh2017
Copy link
Copy Markdown
Collaborator Author

shuynh2017 commented Feb 25, 2026

@Gregory-Pereira, @lionelvillard pls help to review. Thanks.

@lionelvillard
Copy link
Copy Markdown
Collaborator

@shuynh2017 @Gregory-Pereira do you have an issue/design doc for this feature? In particular have you compare tracing vs CRD status and metrics?

@shuynh2017
Copy link
Copy Markdown
Collaborator Author

@lionelvillard, this PR introduces tracing using Otel which in addition to tracing, also provides metrics and logging. With Otel, users can have a consolidate view of telemetry of all components in a system (e.g. llmd, wva, etc ...). For each component, users can select to see the telemetry in the component based on label for example all the scaling decisions for a particular model or variant. For each decision, users can drill down to see the decision at different spots in the code, as well as the final metric being emitted. Tracing/span is only one use for now. In the future, we might want also to use metrics, loggings.

shuynh2017 and others added 2 commits February 25, 2026 11:16
Signed-off-by: Sum Huynh <31661254+shuynh2017@users.noreply.github.com>
@lionelvillard
Copy link
Copy Markdown
Collaborator

Im very well aware of what otel is. I’m trying to understand the value-add provided by tracing. It only makes sense if the scaling engine can be abstractly decomposed in smaller steps, which it can. These steps need to be documented obviously.

@Gregory-Pereira
Copy link
Copy Markdown
Member

Gregory-Pereira commented Feb 25, 2026

Im very well aware of what otel is. I’m trying to understand the value-add provided by tracing. It only makes sense if the scaling engine can be abstractly decomposed in smaller steps, which it can. These steps need to be documented obviously.

From my perspective the value would be rolling up the metrics autoscaler scales on into related buckets. As we introduce more inputs on which to scale and our scaling logic gets more complicated in WVA this is going to be increasingly more difficult and important to pin down. When I envisioned this I have to admit part of the value for me would be defining places within the WVA logic where we could capture this from spans even if its not doing much. Additionally for things like scaling based on queuing or managing the queue signals I thought it would be helpful to boil the variety of Queuing places down into a single point for clarity between, EPP Queue / post request catch point + vLLM request queue + vLLM running requests, to goal being to identify and provide posterity around "why" or which factor(s) lead to the scaling decision.

Please let me know if I am totally off base on this. I can definitely understand that the request was vauge, and suggested without a formal design doc at the last minute, so if we need to go back to the drawing board we can, but another reason why we wanted this was to tie in with the projects global tracing theme which were pushing in the v0.5.1 release.

@mamy-CS
Copy link
Copy Markdown
Collaborator

mamy-CS commented Feb 26, 2026

e2e and lint failing, please check @shuynh2017

@asm582
Copy link
Copy Markdown
Collaborator

asm582 commented Feb 26, 2026

@Gregory-Pereira Thanks, tracing will help. We need to define how it will be consumed. Currently, the request scheduler and WVA operate on different granularities, and on top of that, WVA performs global optimization. We need to document how to scale correlation across the e2e system with OTEL.

@asm582 asm582 added this to the v0.6.0 milestone Feb 26, 2026
@Gregory-Pereira
Copy link
Copy Markdown
Member

Understood, lets hold off on this until the next release and we can revisit the problem

@asm582
Copy link
Copy Markdown
Collaborator

asm582 commented Mar 5, 2026

Removing this issue, as it is not a priority for v0.6.

@asm582 asm582 removed this from the v0.6.0 milestone Mar 5, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the lifecycle/stale label.

@shuynh2017
Copy link
Copy Markdown
Collaborator Author

keep it open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants