Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Router observability (Current QPS, router-side queueing delay, etc) #78

Open
sitloboi2012 opened this issue Feb 7, 2025 · 3 comments
Assignees
Labels
feature request New feature or request

Comments

@sitloboi2012
Copy link
Contributor

This issue dedicated to discuss about the feature:

(P1) Router observability (Current QPS, router-side queueing delay, number of pending / prefilling / decoding requests, average prefill / decoding length, etc)

@gaocegege gaocegege added the feature request New feature or request label Feb 7, 2025
@gaocegege gaocegege changed the title Feat: Router observability Feat: Router observability (Current QPS, router-side queueing delay, etc) Feb 7, 2025
@sitloboi2012
Copy link
Contributor Author

@gaocegege @ApostaC did you guy think of any specific layout for the dashboard yet or for now let's just dump them in first and then think about the layout later, I'm currently splitting out the metrics into 3 main groups:

  • Core vLLM Metrics: “Available vLLM instances”, “Request latency distribution”. “Request TTFT distribution”

  • Operational Metrics: Number of Running Request, GPU KV Usage Percentage, Number of Pending Request, GPU KV cache hit rate

  • Router Observability Metrics: “Current QPS”, “Router‐side Queueing Delay”, “Average Prefill Length”, “Number of Prefilling Requests”, “Number of Decoding Requests”, “Average Decoding Length”

@YuhanLiu11
Copy link
Collaborator

YuhanLiu11 commented Feb 9, 2025

@sitloboi2012 This looks good!

Just for your reference, below are our earlier design on this:

  • Overview of the system: Number of currently healthy vLLM pods, Number of requests that are processed or queuing, Average latency.
  • QoS information: Timeseries of average QPS, Average TTFT, Average ITL
  • Serving engine load: Timeseries of GPU KV cache usage, Number of running requests, Number of queuing requests, Number of swapped requests
  • Current resource usage: Timeseries of GPU, CPU, Memory and Disk usage

(cc @ApostaC )

@sitloboi2012
Copy link
Contributor Author

nice, I think yours @YuhanLiu11 makes more sense, mine was like guessing around based on the usage and using ChatGPT to get some suggestions 😆
I will update again based on these references, thanks for your input, appreciate it 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants