-
Notifications
You must be signed in to change notification settings - Fork 94
Closed
Labels
component/sidecarneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.
Description
vLLM has added a feature called Data Parallel (DP). When DP is active vLLM runs many inference engines in the same Kubernetes pod. There are several different configurations all have multiple vLLM engines on one pod.
One of the DP configurations has multiple vLLMs riunning on one pod each with it's own inference port and metrics port.
When running Disambiguated Prefill/Decode (PD) we run with an llm-d-routing-sidecar in front of the vLLM engine in the decode pods. When running PD with DP enabled, we need the llm-d-routing-sidepar to listen on N ports and have each port work with one specific vLLM engine in the pod.
To simplify things:
- We can add a command line argument
--data-parallel-size=N, where N indicates how many vLLM engines are running in parallel. This is the same command line argument used by vLLM.Nmust be greater than zero and for now less then or equal to eight, with a default of one. - When --data-parallel-size is used with a value greater than one:
- The sidecar will listen on
Nports starting with the port specified by the--portcommand line argument - The sidecar will communicate with
NvLLM engines all listening on separate ports starting with the port number specified by the --vllm-port` coommad line argument
- The sidecar will listen on
Metadata
Metadata
Assignees
Labels
component/sidecarneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.
Type
Projects
Status
Done