Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Roadmap] vLLM production stack roadmap for 2025 Q1 #26

Open
7 of 15 tasks
ApostaC opened this issue Jan 27, 2025 · 19 comments
Open
7 of 15 tasks

[Roadmap] vLLM production stack roadmap for 2025 Q1 #26

ApostaC opened this issue Jan 27, 2025 · 19 comments

Comments

@ApostaC
Copy link
Collaborator

ApostaC commented Jan 27, 2025

This project's scope involves a set of production-related modules around vLLM, including router, autoscaling, observability, KV cache offloading, and framework supports (KServe, Ray, etc).

This document will include the items on our Q1 roadmap. We will keep updating this document to include the related issues, pull requests, and discussions in the #production-stack channel in the vLLM slack.

Core features

CI/CD and packaging

OSS-related supports


If any of the items you wanted are not on the roadmap, your suggestion and contribution are strongly welcomed! Please feel free to comment in this thread, open a feature request, or create an RFC.

Happy vLLMing!

@gaocegege
Copy link
Collaborator

I think if we have the refactor in the roadmap:

(P2) Transcode the router using a more performant language (e.g., Rust and Go) for better QPS/throughput and lower delay

It would be best to tackle this as soon as possible. As the router grows, it’s only going to get trickier. For the language, I’d go with Go since it meshes well with Kubernetes and monitoring. Go’s k8s client library is way more mature, while Rust's support for Kubernetes isn’t great.

@hmellor
Copy link
Contributor

hmellor commented Jan 28, 2025

(P0) format checker for the code

I can help bring over the new formatting from vllm if you'd like? It's much simpler than format.sh and (as long as it gets installed) you can't forget to run it!

@ApostaC
Copy link
Collaborator Author

ApostaC commented Jan 28, 2025

It would be best to tackle this as soon as possible. As the router grows, it’s only going to get trickier. For the language, I’d go with Go since it meshes well with Kubernetes and monitoring. Go’s k8s client library is way more mature, while Rust's support for Kubernetes isn’t great.

@gaocegege I see your point. One thing that makes me a bit hesitant is that python is more friendly to the LLM community.
The current plan is to first have a performance benchmark for the router and see how "bad" the current python version is.

Another backup solution in my mind is we have a Go backbone for the data plane but have some python-interface for the routing logics, so that the community (including both industry and academia) can contribute to it.

@ApostaC
Copy link
Collaborator Author

ApostaC commented Jan 28, 2025

I can help bring over the new formatting from vllm if you'd like? It's much simpler than format.sh and (as long as it gets installed) you can't forget to run it!

@hmellor Thanks for chiming in! Would love to see your contribution! Feel free to create a new issue or PR for this!

@gaocegege
Copy link
Collaborator

It would be best to tackle this as soon as possible. As the router grows, it’s only going to get trickier. For the language, I’d go with Go since it meshes well with Kubernetes and monitoring. Go’s k8s client library is way more mature, while Rust's support for Kubernetes isn’t great.

@gaocegege I see your point. One thing that makes me a bit hesitant is that python is more friendly to the LLM community. The current plan is to first have a performance benchmark for the router and see how "bad" the current python version is.

Another backup solution in my mind is we have a Go backbone for the data plane but have some python-interface for the routing logics, so that the community (including both industry and academia) can contribute to it.

SGTM.

@hmellor
Copy link
Contributor

hmellor commented Jan 29, 2025

@ApostaC great! Please take a look at #35

@spron-in
Copy link

I'm curious if you see the evolution of production stack towards an operator. I see someone already suggested CRDs here: #7

With the growth of complexity I would see this stack evolving and becoming more k8s native to simplify operations even more.

@ApostaC ApostaC pinned this issue Jan 29, 2025
@ApostaC
Copy link
Collaborator Author

ApostaC commented Jan 29, 2025

I'm curious if you see the evolution of production stack towards an operator.

Thanks for bring up the question @spron-in .
Right now, the stack is relatively simple so we don’t have immediate plans to do this. Also, we hope that the components in this stack can also be directly ported for different purposes.

But we will definitely consider a more end-to-end solution to simplify the operations when the stack grows more complicated.

@nithin8702
Copy link

  • Looks like we can't change resources(cpu,memory) of vllm-stack-deployment-router deployment
  • Examples of bringing your own model would be helpful - pytorch preferred
  • Shall we have image tag with specific version instead of latest for every helm releases?

@ApostaC
Copy link
Collaborator Author

ApostaC commented Feb 3, 2025

Hey @nithin8702 , thanks for your interest. Towards your questions:


  • Looks like we can't change resources(cpu,memory) of vllm-stack-deployment-router deployment

This should be done in #38


  • Examples of bringing your own model would be helpful - pytorch preferred

Do you mean load models from some local storage? Please refer to this tutorial.


  • Shall we have image tag with specific version instead of latest for every helm releases?

Can you elaborate a bit more about this question? I'm not sure I get what you want.

@AlexXi19
Copy link

AlexXi19 commented Feb 5, 2025

Also interested in seeing least request load balancing!

@sitloboi2012
Copy link
Contributor

sitloboi2012 commented Feb 6, 2025

I'm currently looking at this (P1) Router observability (Current QPS, router-side queueing delay, number of pending / prefilling / decoding requests, average prefill / decoding length, etc).

Do we have like a list of default observability metrics that we want to have yet: eg the list in the requirement, or you know let's just kinda do some research and pick out some of them for the features and use as default for now ?

I'm planning to contribute to this one

@ApostaC
Copy link
Collaborator Author

ApostaC commented Feb 6, 2025

Also interested in seeing least request load balancing!

@AlexXi19 Hey Alex, we are discussing this in #59, and Kuntai (@KuntaiDu) is currently designing and implementing the functionality in the router.

@ApostaC
Copy link
Collaborator Author

ApostaC commented Feb 6, 2025

I'm currently looking at this (P1) Router observability (Current QPS, router-side queueing delay, number of pending / prefilling / decoding requests, average prefill / decoding length, etc).

Do we have like a list of default observability metrics that we want to have yet: eg the list in the requirement, or you know let's just kinda do some research and pick out some of them for the features and use as default for now ?

I'm planning to contribute to this one

Thanks @sitloboi2012 ! Currently the router does maintain a list of stats internally. We can first open the interface and dump those metrics to Prometheus.

Would you like to open an issue for further discussion?

@sitloboi2012
Copy link
Contributor

sitloboi2012 commented Feb 7, 2025

I'm currently looking at this (P1) Router observability (Current QPS, router-side queueing delay, number of pending / prefilling / decoding requests, average prefill / decoding length, etc).
Do we have like a list of default observability metrics that we want to have yet: eg the list in the requirement, or you know let's just kinda do some research and pick out some of them for the features and use as default for now ?
I'm planning to contribute to this one

Thanks @sitloboi2012 ! Currently the router does maintain a list of stats internally. We can first open the interface and dump those metrics to Prometheus.

Would you like to open an issue for further discussion?

Yep, let's move this to this Issue: Feat: Router Observability @ApostaC

@nithin8702
Copy link

Hi Team

Shall we have release notes for every release?

@ApostaC
Copy link
Collaborator Author

ApostaC commented Feb 13, 2025

Hi Team

Shall we have release notes for every release?

Good question! Will do that soon. @Shaoting-Feng @YuhanLiu11 Take a note please?

@nithin8702
Copy link

Does this support multi-tenancy or namespace isolation ?

@ApostaC
Copy link
Collaborator Author

ApostaC commented Feb 13, 2025

Does this support multi-tenancy or namespace isolation ?

You can specify the namespace when doing helm install

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants