Skip to content

feat: Extend PD support to include when vLLM is configured to run in Data Parallel mode #401

@elevran

Description

@elevran

vLLM has added a feature called Data Parallel (DP). When DP is active vLLM runs many inference engines in the same Kubernetes pod. There are several different configurations all have multiple vLLM engines on one pod.

One of the DP configurations has multiple vLLMs riunning on one pod each with it's own inference port and metrics port.

When running Disambiguated Prefill/Decode (PD) we run with an llm-d-routing-sidecar in front of the vLLM engine in the decode pods. When running PD with DP enabled, we need the llm-d-routing-sidepar to listen on N ports and have each port work with one specific vLLM engine in the pod.

To simplify things:

  • We can add a command line argument --data-parallel-size=N, where N indicates how many vLLM engines are running in parallel. This is the same command line argument used by vLLM. N must be greater than zero and for now less then or equal to eight, with a default of one.
  • When --data-parallel-size is used with a value greater than one:
    • The sidecar will listen on N ports starting with the port specified by the --port command line argument
    • The sidecar will communicate with N vLLM engines all listening on separate ports starting with the port number specified by the --vllm-port` coommad line argument

Metadata

Metadata

Assignees

No one assigned

    Labels

    component/sidecarneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions