Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLO-Driven Resource Management for vLLM #755

Open
Jeffwan opened this issue Feb 27, 2025 · 0 comments
Open

SLO-Driven Resource Management for vLLM #755

Jeffwan opened this issue Feb 27, 2025 · 0 comments
Assignees

Comments

@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 27, 2025

🚀 Feature Description and Motivation

Background

Different requests have varying input/output lengths, leading to diverse resource requirements. Currently, when a batch of requests gets scheduled together, it is difficult to guarantee individual users' SLA. To address this, we want to design a solution that provides strong SLO guarantees and manages resource tiers and cost models in an SLO-driven manner. This will require engine co-design efforts.

Challenges

There're few challenges on the concepts that vLLM and external system doesn't have at this moment.

  1. Should we use goodput as the primary metric or rely on simpler single-dimension metrics like TTFT or TPOT?
  2. Should we define resource classes based on request profiles
  3. How can we ensure fair allocation without underutilizing GPU resources?
  4. How to map the SLO to Token Pricing model?

I will skip the proposal part and leave this to public discussion at this moment

Use Case

support multi-tenant use case

Proposed Solution

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants