feat: Extend PD support to include when vLLM is configured to run in Data Parallel mode

vLLM has added a feature called Data Parallel (DP). When DP is active vLLM runs many inference engines in the same Kubernetes pod. There are several different configurations all have multiple vLLM engines on one pod.

One of the DP configurations has multiple vLLMs riunning on one pod each with it's own inference port and metrics port.

When running Disambiguated Prefill/Decode (PD) we run with an llm-d-routing-sidecar in front of the vLLM engine in the decode pods. When running PD with DP enabled, we need the llm-d-routing-sidepar to listen on N ports and have each port work with one specific vLLM engine in the pod.

To simplify things:
 
* We can add a command line argument `--data-parallel-size=N`, where N indicates how many vLLM engines are running in parallel. This is the same command line argument used by vLLM. `N` must be greater than zero and for now less then or equal to eight, with a default of one.
* When --data-parallel-size is used with a value greater than one:
   * The sidecar will listen on `N` ports starting with the port specified by the `--port` command line argument
   * The sidecar will communicate with `N` vLLM engines all listening on separate ports starting with the  port number specified by the --vllm-port` coommad line argument



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Extend PD support to include when vLLM is configured to run in Data Parallel mode #401

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Extend PD support to include when vLLM is configured to run in Data Parallel mode #401

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions