SLO-Driven Resource Management for vLLM #755

Jeffwan · 2025-02-27T00:31:16Z

🚀 Feature Description and Motivation

Background

Different requests have varying input/output lengths, leading to diverse resource requirements. Currently, when a batch of requests gets scheduled together, it is difficult to guarantee individual users' SLA. To address this, we want to design a solution that provides strong SLO guarantees and manages resource tiers and cost models in an SLO-driven manner. This will require engine co-design efforts.

Challenges

There're few challenges on the concepts that vLLM and external system doesn't have at this moment.

Should we use goodput as the primary metric or rely on simpler single-dimension metrics like TTFT or TPOT?
Should we define resource classes based on request profiles
How can we ensure fair allocation without underutilizing GPU resources?
How to map the SLO to Token Pricing model?

I will skip the proposal part and leave this to public discussion at this moment

Use Case

support multi-tenant use case

Proposed Solution

No response

Jeffwan assigned DwyaneShi Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLO-Driven Resource Management for vLLM #755

SLO-Driven Resource Management for vLLM #755

Jeffwan commented Feb 27, 2025

SLO-Driven Resource Management for vLLM #755

SLO-Driven Resource Management for vLLM #755

Comments

Jeffwan commented Feb 27, 2025

🚀 Feature Description and Motivation

Background

Challenges

Use Case

Proposed Solution