Conversation
|
@Gregory-Pereira, @lionelvillard pls help to review. Thanks. |
|
@shuynh2017 @Gregory-Pereira do you have an issue/design doc for this feature? In particular have you compare tracing vs CRD status and metrics? |
|
@lionelvillard, this PR introduces tracing using Otel which in addition to tracing, also provides metrics and logging. With Otel, users can have a consolidate view of telemetry of all components in a system (e.g. llmd, wva, etc ...). For each component, users can select to see the telemetry in the component based on label for example all the scaling decisions for a particular model or variant. For each decision, users can drill down to see the decision at different spots in the code, as well as the final metric being emitted. Tracing/span is only one use for now. In the future, we might want also to use metrics, loggings. |
Signed-off-by: Sum Huynh <31661254+shuynh2017@users.noreply.github.com>
|
Im very well aware of what otel is. I’m trying to understand the value-add provided by tracing. It only makes sense if the scaling engine can be abstractly decomposed in smaller steps, which it can. These steps need to be documented obviously. |
From my perspective the value would be rolling up the metrics autoscaler scales on into related buckets. As we introduce more inputs on which to scale and our scaling logic gets more complicated in WVA this is going to be increasingly more difficult and important to pin down. When I envisioned this I have to admit part of the value for me would be defining places within the WVA logic where we could capture this from spans even if its not doing much. Additionally for things like scaling based on queuing or managing the queue signals I thought it would be helpful to boil the variety of Queuing places down into a single point for clarity between, EPP Queue / post request catch point + vLLM request queue + vLLM running requests, to goal being to identify and provide posterity around "why" or which factor(s) lead to the scaling decision. Please let me know if I am totally off base on this. I can definitely understand that the request was vauge, and suggested without a formal design doc at the last minute, so if we need to go back to the drawing board we can, but another reason why we wanted this was to tie in with the projects global tracing theme which were pushing in the v0.5.1 release. |
|
e2e and lint failing, please check @shuynh2017 |
|
@Gregory-Pereira Thanks, tracing will help. We need to define how it will be consumed. Currently, the request scheduler and WVA operate on different granularities, and on top of that, WVA performs global optimization. We need to document how to scale correlation across the e2e system with OTEL. |
|
Understood, lets hold off on this until the next release and we can revisit the problem |
|
Removing this issue, as it is not a priority for v0.6. |
|
This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the |
|
keep it open |
This PR introduces tracing with OpenTelemetry (Otel) for WVA. Users can configure WVA to export traces (for now, future may be metrics, logs) to a backend (like Jaeger, Prometheus, Grafana Tempo) via the OTLP (OpenTelemetry) protocol.
At this time, there are only 2 traces. We can add more as needed. This doesn't yet support client tls to Otel collector. I will address this next. Here are some screenshots of using Jaeger as collector to view tracing of scaling decisions:

