You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Different requests have varying input/output lengths, leading to diverse resource requirements. Currently, when a batch of requests gets scheduled together, it is difficult to guarantee individual users' SLA. To address this, we want to design a solution that provides strong SLO guarantees and manages resource tiers and cost models in an SLO-driven manner. This will require engine co-design efforts.
Challenges
There're few challenges on the concepts that vLLM and external system doesn't have at this moment.
Should we use goodput as the primary metric or rely on simpler single-dimension metrics like TTFT or TPOT?
Should we define resource classes based on request profiles
How can we ensure fair allocation without underutilizing GPU resources?
How to map the SLO to Token Pricing model?
I will skip the proposal part and leave this to public discussion at this moment
Use Case
support multi-tenant use case
Proposed Solution
No response
The text was updated successfully, but these errors were encountered:
🚀 Feature Description and Motivation
Background
Different requests have varying input/output lengths, leading to diverse resource requirements. Currently, when a batch of requests gets scheduled together, it is difficult to guarantee individual users' SLA. To address this, we want to design a solution that provides strong SLO guarantees and manages resource tiers and cost models in an SLO-driven manner. This will require engine co-design efforts.
Challenges
There're few challenges on the concepts that vLLM and external system doesn't have at this moment.
I will skip the proposal part and leave this to public discussion at this moment
Use Case
support multi-tenant use case
Proposed Solution
No response
The text was updated successfully, but these errors were encountered: