feed.xml

<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.3">Jekyll</generator><link href="https://blog.vllm.ai/feed.xml" rel="self" type="application/atom+xml" /><link href="https://blog.vllm.ai/" rel="alternate" type="text/html" /><updated>2025-01-30T11:55:29-08:00</updated><id>https://blog.vllm.ai/feed.xml</id><title type="html">vLLM Blog</title><author><name>© 2025. vLLM Team. All rights reserved.</name></author><entry><title type="html">Introducing vLLM Inference Provider in Llama Stack</title><link href="https://blog.vllm.ai/2025/01/27/intro-to-llama-stack-with-vllm.html" rel="alternate" type="text/html" title="Introducing vLLM Inference Provider in Llama Stack" /><published>2025-01-27T00:00:00-08:00</published><updated>2025-01-27T00:00:00-08:00</updated><id>https://blog.vllm.ai/2025/01/27/intro-to-llama-stack-with-vllm</id><content type="html" xml:base="https://blog.vllm.ai/2025/01/27/intro-to-llama-stack-with-vllm.html"><![CDATA[<p>We are excited to announce that vLLM inference provider is now available in <a href="https://github.com/meta-llama/llama-stack">Llama Stack</a> through the collaboration between the Red Hat AI Engineering team and the Llama Stack team from Meta. This article provides an introduction to this integration and a tutorial to help you get started using it locally or deploying it in a Kubernetes cluster.</p>

<h1 id="what-is-llama-stack">What is Llama Stack?</h1>

<p><img align="right" src="/assets/figures/llama-stack/llama-stack.png" alt="llama-stack-diagram" width="50%" height="50%" /></p>

<p>Llama Stack defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Service Providers providing their implementations.</p>

<p>Llama Stack focuses on making it easy to build production applications with a variety of models - ranging from the latest Llama 3.3 model to specialized models like Llama Guard for safety and other models. The goal is to provide pre-packaged implementations (aka “distributions”) which can be run in a variety of deployment environments. The Stack can assist you in your entire app development lifecycle - start iterating on local, mobile or desktop and seamlessly transition to on-prem or public cloud deployments. At every point in this transition, the same set of APIs and the same developer experience are available.</p>

<p>Each specific implementation of an API is called a “Provider” in this architecture. Users can swap providers via configuration. vLLM is a prominent example of a high-performance API backing the inference API.</p>

<h1 id="vllm-inference-provider">vLLM Inference Provider</h1>

<p>Llama Stack provides two vLLM inference providers:</p>
<ol>
  <li><a href="https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/remote-vllm.html">Remote vLLM inference provider</a> through vLLM’s <a href="https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-completions-api-with-vllm">OpenAI-compatible server</a>;</li>
  <li><a href="https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline/inference/vllm">Inline vLLM inference provider</a> that runs alongside with Llama Stack server.</li>
</ol>

<p>In this article, we will demonstrate the functionality through the remote vLLM inference provider.</p>

<h1 id="tutorial">Tutorial</h1>

<h2 id="prerequisites">Prerequisites</h2>

<ul>
  <li>Linux operating system</li>
  <li><a href="https://huggingface.co/docs/huggingface_hub/main/en/guides/cli">Hugging Face CLI</a> if you’d like to download the model via CLI.</li>
  <li>OCI-compliant container technologies like <a href="https://podman.io/">Podman</a> or <a href="https://www.docker.com/">Docker</a> (can be specified via the <code class="language-plaintext highlighter-rouge">CONTAINER_BINARY</code> environment variable when running <code class="language-plaintext highlighter-rouge">llama stack</code> CLI commands).</li>
  <li><a href="https://kind.sigs.k8s.io/">Kind</a> for Kubernetes deployment.</li>
  <li><a href="https://github.com/conda/conda">Conda</a> for managing Python environment.</li>
</ul>

<h2 id="get-started-via-containers">Get Started via Containers</h2>

<h3 id="start-vllm-server">Start vLLM Server</h3>

<p>We first download the “Llama-3.2-1B-Instruct” model using the <a href="https://huggingface.co/docs/huggingface_hub/main/en/guides/cli">Hugging Face CLI</a>. Note that you’ll need to specify your Hugging Face token when logging in.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> /tmp/test-vllm-llama-stack
huggingface-cli login <span class="nt">--token</span> &lt;YOUR-HF-TOKEN&gt;
huggingface-cli download meta-llama/Llama-3.2-1B-Instruct <span class="nt">--local-dir</span> /tmp/test-vllm-llama-stack/.cache/huggingface/hub/models/Llama-3.2-1B-Instruct
</code></pre></div></div>

<p>Next, let’s build the vLLM CPU container image from source. Note that while we use it for demonstration purposes, there are plenty of <a href="https://docs.vllm.ai/en/latest/getting_started/installation/index.html">other images available for different hardware and architectures</a>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone git@github.com:vllm-project/vllm.git /tmp/test-vllm-llama-stack
cd /tmp/test-vllm-llama-stack/vllm
podman build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .
</code></pre></div></div>

<p>We can then start the vLLM container:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>podman run <span class="nt">-it</span> <span class="nt">--network</span><span class="o">=</span>host <span class="se">\</span>
   <span class="nt">--group-add</span><span class="o">=</span>video <span class="se">\</span>
   <span class="nt">--ipc</span><span class="o">=</span>host <span class="se">\</span>
   <span class="nt">--cap-add</span><span class="o">=</span>SYS_PTRACE <span class="se">\</span>
   <span class="nt">--security-opt</span> <span class="nv">seccomp</span><span class="o">=</span>unconfined <span class="se">\</span>
   <span class="nt">--device</span> /dev/kfd <span class="se">\</span>
   <span class="nt">--device</span> /dev/dri <span class="se">\</span>
   <span class="nt">-v</span> /tmp/test-vllm-llama-stack/.cache/huggingface/hub/models/Llama-3.2-1B-Instruct:/app/model <span class="se">\</span>
   <span class="nt">--entrypoint</span><span class="o">=</span><span class="s1">'["python3", "-m", "vllm.entrypoints.openai.api_server", "--model", "/app/model", "--served-model-name", "meta-llama/Llama-3.2-1B-Instruct", "--port", "8000"]'</span> <span class="se">\</span>
    vllm-cpu-env
</code></pre></div></div>

<p>We can get a list of models and test a prompt once the model server has started:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl http://localhost:8000/v1/models

curl http://localhost:8000/v1/completions <span class="se">\</span>
    <span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="se">\</span>
    <span class="nt">-d</span> <span class="s1">'{
        "model": "meta-llama/Llama-3.2-1B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'</span>
</code></pre></div></div>

<h3 id="start-llama-stack-server">Start Llama Stack Server</h3>

<p>Once we verify that the vLLM server has started successfully and is able to serve requests, we can then build and start the Llama Stack server.</p>

<p>First, we clone the Llama Stack source code and create a Conda environment that includes all the dependencies:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone git@github.com:meta-llama/llama-stack.git /tmp/test-vllm-llama-stack/llama-stack
cd /tmp/test-vllm-llama-stack/llama-stack
conda create -n stack python=3.10
conda activate stack
pip install .
</code></pre></div></div>

<p>Next, we build the container image with <code class="language-plaintext highlighter-rouge">llama stack build</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cat &gt; /tmp/test-vllm-llama-stack/vllm-llama-stack-build.yaml &lt;&lt; "EOF"
name: vllm
distribution_spec:
  description: Like local, but use vLLM for running LLM inference
  providers:
    inference: remote::vllm
    safety: inline::llama-guard
    agents: inline::meta-reference
    vector_io: inline::faiss
    datasetio: inline::localfs
    scoring: inline::basic
    eval: inline::meta-reference
    post_training: inline::torchtune
    telemetry: inline::meta-reference
image_type: container
EOF

export CONTAINER_BINARY=podman
LLAMA_STACK_DIR=. PYTHONPATH=. python -m llama_stack.cli.llama stack build --config /tmp/test-vllm-llama-stack/vllm-llama-stack-build.yaml --image-name distribution-myenv
</code></pre></div></div>

<p>Once the container image has been built successfully, we can then edit the generated <code class="language-plaintext highlighter-rouge">vllm-run.yaml</code> to be <code class="language-plaintext highlighter-rouge">/tmp/test-vllm-llama-stack/vllm-llama-stack-run.yaml</code> with the following change in the <code class="language-plaintext highlighter-rouge">models</code> field:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>models:
- metadata: {}
  model_id: ${env.INFERENCE_MODEL}
  provider_id: vllm
  provider_model_id: null
</code></pre></div></div>

<p>Then we can start the Llama Stack Server with the image we built via <code class="language-plaintext highlighter-rouge">llama stack run</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>export INFERENCE_ADDR=host.containers.internal
export INFERENCE_PORT=8000
export INFERENCE_MODEL=meta-llama/Llama-3.2-1B-Instruct
export LLAMA_STACK_PORT=5000

LLAMA_STACK_DIR=. PYTHONPATH=. python -m llama_stack.cli.llama stack run \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env VLLM_URL=http://$INFERENCE_ADDR:$INFERENCE_PORT/v1 \
--env VLLM_MAX_TOKENS=8192 \
--env VLLM_API_TOKEN=fake \
--env LLAMA_STACK_PORT=$LLAMA_STACK_PORT \
/tmp/test-vllm-llama-stack/vllm-llama-stack-run.yaml
</code></pre></div></div>

<p>Alternatively, we can run the following <code class="language-plaintext highlighter-rouge">podman run</code> command instead:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>podman run --security-opt label=disable -it --network host -v /tmp/test-vllm-llama-stack/vllm-llama-stack-run.yaml:/app/config.yaml -v /tmp/test-vllm-llama-stack/llama-stack:/app/llama-stack-source \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env VLLM_URL=http://$INFERENCE_ADDR:$INFERENCE_PORT/v1 \
--env VLLM_MAX_TOKENS=8192 \
--env VLLM_API_TOKEN=fake \
--env LLAMA_STACK_PORT=$LLAMA_STACK_PORT \
--entrypoint='["python", "-m", "llama_stack.distribution.server.server", "--yaml-config", "/app/config.yaml"]' \
localhost/distribution-myenv:dev
</code></pre></div></div>

<p>Once we start the Llama Stack server successfully, we can then start testing a inference request:</p>

<p>Via Bash:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"
</code></pre></div></div>

<p>Output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ChatCompletionResponse(
    completion_message=CompletionMessage(
        content="Hello! I'm an AI, a conversational AI model. I'm a type of computer program designed to understand and respond to human language. My creators have 
trained me on a vast amount of text data, allowing me to generate human-like responses to a wide range of questions and topics. I'm here to help answer any question you 
may have, so feel free to ask me anything!",
        role='assistant',
        stop_reason='end_of_turn',
        tool_calls=[]
    ),
    logprobs=None
)
</code></pre></div></div>

<p>Via Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">os</span>
<span class="kn">from</span> <span class="n">llama_stack_client</span> <span class="kn">import</span> <span class="n">LlamaStackClient</span>

<span class="n">client</span> <span class="o">=</span> <span class="nc">LlamaStackClient</span><span class="p">(</span><span class="n">base_url</span><span class="o">=</span><span class="sa">f</span><span class="sh">"</span><span class="s">http://localhost:</span><span class="si">{</span><span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="sh">'</span><span class="s">LLAMA_STACK_PORT</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>

<span class="c1"># List available models
</span><span class="n">models</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">models</span><span class="p">.</span><span class="nf">list</span><span class="p">()</span>
<span class="nf">print</span><span class="p">(</span><span class="n">models</span><span class="p">)</span>

<span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">inference</span><span class="p">.</span><span class="nf">chat_completion</span><span class="p">(</span>
    <span class="n">model_id</span><span class="o">=</span><span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="sh">"</span><span class="s">INFERENCE_MODEL</span><span class="sh">"</span><span class="p">],</span>
    <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
        <span class="p">{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">system</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">You are a helpful assistant.</span><span class="sh">"</span><span class="p">},</span>
        <span class="p">{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">user</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">Write a haiku about coding</span><span class="sh">"</span><span class="p">}</span>
    <span class="p">]</span>
<span class="p">)</span>
<span class="nf">print</span><span class="p">(</span><span class="n">response</span><span class="p">.</span><span class="n">completion_message</span><span class="p">.</span><span class="n">content</span><span class="p">)</span>
</code></pre></div></div>

<p>Output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Model(identifier='meta-llama/Llama-3.2-1B-Instruct', metadata={}, api_model_type='llm', provider_id='vllm', provider_resource_id='meta-llama/Llama-3.2-1B-Instruct', type='model', model_type='llm')]
Here is a haiku about coding:

Columns of code flow
Logic codes the endless night
Tech's silent dawn rise
</code></pre></div></div>

<h2 id="deployment-on-kubernetes">Deployment on Kubernetes</h2>

<p>Instead of starting the Llama Stack and vLLM servers locally. We can deploy them in a Kubernetes cluster. We’ll use a local Kind cluster for demonstration purposes:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kind create cluster --image kindest/node:v1.32.0 --name llama-stack-test
</code></pre></div></div>

<p>Start vLLM server as a Kubernetes Pod and Service (remember to replace <code class="language-plaintext highlighter-rouge">&lt;YOUR-HF-TOKEN&gt;</code> with your actual token):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cat &lt;&lt;EOF |kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-models
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
data:
  token: "&lt;YOUR-HF-TOKEN&gt;"
---
apiVersion: v1
kind: Pod
metadata:
  name: vllm-server
  labels:
    app: vllm
spec:
  containers:
  - name: llama-stack
    image: localhost/vllm-cpu-env:latest
    command:
        - bash
        - -c
        - |
          MODEL="meta-llama/Llama-3.2-1B-Instruct"
          MODEL_PATH=/app/model/$(basename $MODEL)
          huggingface-cli login --token $HUGGING_FACE_HUB_TOKEN
          huggingface-cli download $MODEL --local-dir $MODEL_PATH --cache-dir $MODEL_PATH
          python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH --served-model-name $MODEL --port 8000
    ports:
      - containerPort: 8000
    volumeMounts:
      - name: llama-storage
        mountPath: /app/model
    env:
      - name: HUGGING_FACE_HUB_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-token-secret
            key: token
  volumes:
  - name: llama-storage
    persistentVolumeClaim:
      claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
spec:
  selector:
    app: vllm
  ports:
  - port: 8000
    targetPort: 8000
  type: NodePort
EOF
</code></pre></div></div>

<p>We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ kubectl logs vllm-server
...
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
</code></pre></div></div>

<p>Then we can modify the previously created <code class="language-plaintext highlighter-rouge">vllm-llama-stack-run.yaml</code> to <code class="language-plaintext highlighter-rouge">/tmp/test-vllm-llama-stack/vllm-llama-stack-run-k8s.yaml</code> with the following inference provider:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>providers:
  inference:
  - provider_id: vllm
    provider_type: remote::vllm
    config:
      url: http://vllm-server.default.svc.cluster.local:8000/v1
      max_tokens: 4096
      api_token: fake
</code></pre></div></div>

<p>Once we have defined the run configuration for Llama Stack, we can build an image with that configuration and the server source code:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cat &gt;/tmp/test-vllm-llama-stack/Containerfile.llama-stack-run-k8s &lt;&lt;EOF
FROM distribution-myenv:dev

RUN apt-get update &amp;&amp; apt-get install -y git
RUN git clone https://github.com/meta-llama/llama-stack.git /app/llama-stack-source

ADD ./vllm-llama-stack-run-k8s.yaml /app/config.yaml
EOF
podman build -f /tmp/test-vllm-llama-stack/Containerfile.llama-stack-run-k8s -t llama-stack-run-k8s /tmp/test-vllm-llama-stack
</code></pre></div></div>

<p>We can then start the Llama Stack server by deploying a Kubernetes Pod and Service:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cat &lt;&lt;EOF |kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llama-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: llama-stack-pod
  labels:
    app: llama-stack
spec:
  containers:
  - name: llama-stack
    image: localhost/llama-stack-run-k8s:latest
    imagePullPolicy: IfNotPresent
    command: ["python", "-m", "llama_stack.distribution.server.server", "--yaml-config", "/app/config.yaml"]
    ports:
      - containerPort: 5000
    volumeMounts:
      - name: llama-storage
        mountPath: /root/.llama
  volumes:
  - name: llama-storage
    persistentVolumeClaim:
      claimName: llama-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: llama-stack-service
spec:
  selector:
    app: llama-stack
  ports:
  - protocol: TCP
    port: 5000
    targetPort: 5000
  type: ClusterIP
EOF
</code></pre></div></div>

<p>We can check that the Llama Stack server has started:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ kubectl logs vllm-server
...
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     ASGI 'lifespan' protocol appears unsupported.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)
</code></pre></div></div>

<p>Now let’s forward the Kubernetes service to a local port and test some inference requests against it via the Llama Stack Client:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl port-forward service/llama-stack-service 5000:5000
llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"
</code></pre></div></div>

<p>You can learn more about different providers and functionalities of Llama Stack on <a href="https://llama-stack.readthedocs.io">the official documentation</a>.</p>

<h2 id="acknowledgement">Acknowledgement</h2>

<p>We’d like to thank the Red Hat AI Engineering team for the implementation of the vLLM inference providers, contributions to many bug fixes, improvements, and key design discussions. We also want to thank the Llama Stack team from Meta and the vLLM team for their timely PR reviews and bug fixes.</p>]]></content><author><name>Yuan Tang (Red Hat) and Ashwin Bharambe (Meta)</name></author><summary type="html"><![CDATA[We are excited to announce that vLLM inference provider is now available in Llama Stack through the collaboration between the Red Hat AI Engineering team and the Llama Stack team from Meta. This article provides an introduction to this integration and a tutorial to help you get started using it locally or deploying it in a Kubernetes cluster.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.vllm.ai/assets/figures/llama-stack/llama-stack.png" /><media:content medium="image" url="https://blog.vllm.ai/assets/figures/llama-stack/llama-stack.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">vLLM V1: A Major Upgrade to vLLM’s Core Architecture</title><link href="https://blog.vllm.ai/2025/01/27/v1-alpha-release.html" rel="alternate" type="text/html" title="vLLM V1: A Major Upgrade to vLLM’s Core Architecture" /><published>2025-01-27T00:00:00-08:00</published><updated>2025-01-27T00:00:00-08:00</updated><id>https://blog.vllm.ai/2025/01/27/v1-alpha-release</id><content type="html" xml:base="https://blog.vllm.ai/2025/01/27/v1-alpha-release.html"><![CDATA[<p align="center">
<picture>
<img src="/assets/figures/v1/vLLM_V1_Logo.png" width="80%" />
</picture>
</p>

<p>We are thrilled to announce the <strong>alpha release of vLLM V1</strong>, a major upgrade to vLLM’s core architecture. Based on lessons we learned over the past 1.5 years of vLLM development, we revisited key design decisions, consolidated various features, and simplified the codebase to enhance flexibility and scalability. V1 already achieves <strong>state-of-the-art performance</strong> and is set to gain even more optimizations. Best of all, users can enable V1 seamlessly—just set the <code class="language-plaintext highlighter-rouge">VLLM_USE_V1=1</code> environment variable <strong>without any changes to the existing API</strong>. After testing and feedback collection in the coming weeks, we plan to transition V1 into the default engine.</p>

<h1 id="why-vllm-v1">Why vLLM V1?</h1>

<h2 id="learning-from-vllm-v0">Learning from vLLM V0</h2>

<p>Over the past 1.5 years, vLLM has achieved remarkable success in supporting diverse models, features, and hardware backends. However, while our community scaled horizontally, we faced challenges making the systems simple and integrating various optimizations vertically across the stack. Features were often developed independently, making it difficult to combine them effectively and cleanly. Over time, technical debt accumulated, prompting us to revisit our foundational design.</p>

<h2 id="goals-of-v1">Goals of V1</h2>

<p>Based on the above motivation, vLLM V1 is designed to:</p>
<ul>
  <li>Provide a <strong>simple, modular, and easy-to-hack codebase</strong>.</li>
  <li>Ensure <strong>high performance</strong> with near-zero CPU overhead.</li>
  <li><strong>Combine key optimizations</strong> into a unified architecture.</li>
  <li>Require <strong>zero configs</strong> by enabling features/optimizations by default.</li>
</ul>

<h2 id="scope-of-v1">Scope of V1</h2>

<p>vLLM V1 introduces a comprehensive re-architecture of its core components, including the scheduler, KV cache manager, worker, sampler, and API server. However, it still shares a lot of code with vLLM V0, such as model implementations, GPU kernels, distributed control plane, and various utility functions. This approach allows V1 to leverage the extensive coverage and stability established by V0 while delivering significant enhancements to performance and code complexity.</p>

<h1 id="whats-new-in-vllm-v1">What’s New in vLLM V1?</h1>

<h2 id="1-optimized-execution-loop--api-server">1. Optimized Execution Loop &amp; API Server</h2>

<p align="center">
<picture>
<img src="/assets/figures/v1/v1_server_architecture.png" width="60%" />
</picture>
</p>

<p>As a full-fledged continuous batching engine and OpenAI-compatible API server, vLLM’s core execution loop relies on CPU operations to manage request states between model forward passes. As GPUs are getting faster and significantly reducing model execution times, the CPU overhead for tasks like running the API server, scheduling work, preparing inputs, de-tokenizing outputs, and streaming responses to users becomes increasingly pronounced. This issue is particularly noticeable with smaller models like Llama-8B running on NVIDIA H100 GPUs, where execution time on the GPU is as low as ~5ms.</p>

<p>In the <a href="https://blog.vllm.ai/2024/09/05/perf-update.html">v0.6.0 release</a>, vLLM introduced a multiprocessing API server utilizing ZeroMQ for IPC, enabling overlap between the API server and AsyncLLM. vLLM V1 extends this by integrating the multiprocessing architecture deeper into the core of AsyncLLM, creating an isolated <code class="language-plaintext highlighter-rouge">EngineCore</code> execution loop that focuses exclusively on the scheduler and model executor. This design allows for greater overlap of CPU-intensive tasks—such as tokenization, multimodal input processing, de-tokenization, and request streaming—with the core execution loop, thereby maximizing model throughput.</p>

<h2 id="2-simple--flexible-scheduler">2. Simple &amp; Flexible Scheduler</h2>

<p align="center">
<picture>
<img src="/assets/figures/v1/v1_scheduling.png" width="60%" />
</picture>
</p>

<p>vLLM V1 introduces a simple yet flexible scheduler. It removes the traditional distinction between “prefill” and “decode” phases by treating user-given prompt tokens and model-generated output tokens uniformly. Scheduling decisions are represented as a simple dictionary, e.g., <code class="language-plaintext highlighter-rouge">{request_id: num_tokens}</code>, which specifies the number of tokens to process for each request at each step. We find that this representation is general enough to support features such as chunked prefills, prefix caching, and speculative decoding. For instance, chunked-prefill scheduling is seamlessly implemented: with a fixed token budget, the scheduler dynamically decides how many tokens to allocate to each request (as shown in the figure above).</p>

<h2 id="3-zero-overhead-prefix-caching">3. Zero-Overhead Prefix Caching</h2>

<p>vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1’s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%.</p>

<p align="center">
<picture>
<img src="/assets/figures/v1/v1_prefix_caching.png" width="90%" />
</picture>
</p>

<p>Here are some benchmark results. In our experiments, we observed that V1’s perfix caching causes less than 1% decrease in throughput even when the cache hit rate is 0%, while it improves the performance several times when the cache hit rate is high. <strong>Thanks to the near-zero overhead, we now enable prefix caching by default in V1.</strong></p>

<h2 id="4-clean-architecture-for-tensor-parallel-inference">4. Clean Architecture for Tensor-Parallel Inference</h2>

<p align="center">
<picture>
<img src="/assets/figures/v1/v1_tp_architecture.png" width="60%" />
</picture>
</p>

<p>vLLM V1 introduces a clean and efficient architecture for tensor-parallel inference, effectively addressing the limitations of V0. In V0, the scheduler and Worker 0 are colocated within the same process to reduce the inter-process communication overhead when broadcasting input data to workers. However, this design introduces an asymmetric architecture, increasing complexity. V1 overcomes this by caching request states on the worker side and transmitting only incremental updates (diffs) at each step. This optimization minimizes inter-process communication, allowing the scheduler and Worker 0 to operate in separate processes, resulting in a clean, symmetric architecture. Moreover, V1 abstracts away most distributed logic, enabling workers to operate the same way for both single-GPU and multi-GPU setups.</p>

<h2 id="5-efficient-input-preparation">5. Efficient Input Preparation</h2>

<p align="center">
<picture>
<img src="/assets/figures/v1/persistent_batch.png" width="50%" />
</picture>
</p>

<p>In vLLM V0, input tensors and metadata for the model are recreated at each step, often leading to significant CPU overhead. To optimize this, V1 implements the <a href="https://github.com/InternLM/lmdeploy">Persistent Batch</a> technique, which caches the input tensors and only applies the diffs to them at each step. Additionally, V1 minimizes the CPU overheads in updating the tensors by extensively utilizing Numpy operations instead of Python’s native ones.</p>

<h2 id="6-torchcompile-and-piecewise-cuda-graphs">6. torch.compile and Piecewise CUDA Graphs</h2>

<p align="center">
<picture>
<img src="/assets/figures/v1/torch_compile_cuda_graph.png" width="70%" />
</picture>
</p>

<p>V1 leverages vLLM’s <code class="language-plaintext highlighter-rouge">torch.compile</code> integration to automatically optimize the model. This allows V1 to efficiently support a wide variety of models while minimizing the need of writing custom kernels. Furthermore, V1 introduces <em>piecewise CUDA graphs</em> to alleviate the limitations of CUDA graphs. We are preparing dedicated blog posts on the torch.compile integration and piecewise CUDA graphs, so <strong>stay tuned for more updates</strong>!</p>

<h2 id="7-enhanced-support-for-multimodal-llms">7. Enhanced Support for Multimodal LLMs</h2>

<p>vLLM V1 treats multimodal large language models (MLLMs) as first-class citizens and introduces several key improvements in their support.</p>

<p>First, V1 optimizes multimodal input preprocessing by moving it to a non-blocking process. For example, image files (e.g., JPG or PNG) must be converted into tensors of pixel values, cropped, and transformed before being fed into the model. This preprocessing can consume significant CPU cycles, possibly leaving the GPU idle. To address this, V1 offloads the preprocessing task to a separate process, preventing it from blocking the GPU worker, and adds a preprocessing cache so that processed inputs can be reused across requests if they share the same multimodal input.</p>

<p>Second, V1 introduces prefix caching for multimodal inputs. In addition to the hash of token IDs, image hashes are used to identify the KV cache for image inputs. This improvement is especially beneficial for multi-turn conversations that include image inputs.</p>

<p>Third, V1 enables chunked-prefill scheduling for MLLMs with the “encoder cache.” In V0, image inputs and text inputs had to be processed in the same step because the LLM decoder’s <img /> token depends on the vision embeddings which are discarded after the step. With the encoder cache, V1 temporarily stores the vision embeddings, allowing the scheduler to split the text inputs into chunks and process them across multiple steps without needing to regenerate vision embeddings every step.</p>

<h2 id="8-flashattention-3">8. FlashAttention 3</h2>

<p>The final piece of the puzzle for vLLM V1 was integrating <a href="https://arxiv.org/abs/2407.08608">FlashAttention 3</a>. Given the high level of dynamism in V1—such as combining prefill and decode within the same batch—a flexible and high-performance attention kernel was essential. FlashAttention 3 effectively addresses this requirement, offering robust support for a wide range of features while maintaining excellent performance across diverse use cases.</p>

<h1 id="performance">Performance</h1>

<p>Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to <strong>1.7x higher throughput</strong> compared to V0 (<em>without multi-step scheduling</em>).
These dramatic performance gains stem from comprehensive CPU overhead reductions across the entire stack.
The improvements are even more pronounced for vision-language models (VLMs) like Qwen2-VL, thanks to V1’s enhanced support for VLMs.</p>

<ul>
  <li><strong>Text Models: Llama 3.1 8B &amp; Llama 3.3 70B</strong></li>
</ul>

<p align="center">
<picture>
<img src="/assets/figures/v1/v1_llama.png" width="100%" />
</picture>
</p>

<p>We measured the performance of vLLM V0 and V1 on Llama 3.1 8B and Llama 3.3 70B models using the ShareGPT dataset.
V1 demonstrated consistently lower latency than V0 especially at high QPS, thanks to the higher throughput it achieves.
Given that the kernels used for V0 and V1 are almost identical, the performance difference is mainly due to the architectural improvements (reduced CPU overheads) in V1.</p>

<ul>
  <li><strong>Vision-language Models: Qwen2-VL</strong></li>
</ul>

<p align="center">
<picture>
<img src="/assets/figures/v1/v1_qwen2vl.png" width="60%" />
</picture>
</p>

<p>We evaluated the performance on VLMs by testing Qwen2-VL using the <a href="https://arxiv.org/abs/2412.08687">VisionArena</a> dataset.
V1 delivered even larger speedups over V0, thanks its improved VLM support, driven by two key improvements: offloading input processing to a separate process and implementing more flexible scheduling for multimodal queries.
We would also like to point out that prefix caching is now natively supported for multimodal models in V1, but will skip the benchmark results here.</p>

<ul>
  <li><strong>Looking Forward</strong></li>
</ul>

<p>While these improvements are significant, we view them as just the beginning.
The redesigned architecture provies a solid foundation that will enable rapid development of new features.
We look forward to sharing additional enhancements in the coming weeks.
Stay tuned for more updates!</p>

<h1 id="limitations--future-work">Limitations &amp; Future Work</h1>

<p>While vLLM V1 shows promising results, it is still in its alpha stage and lacks several features from V0. Here’s a clarification:</p>

<p><strong>Model Support:</strong><br />
V1 supports decoder-only Transformers like Llama, mixture-of-experts (MoE) models like Mixtral, and several VLMs such as Qwen2-VL. All quantization methods are supported. However, V1 currently does not support encoder-decoder architectures like multimodal Llama 3.2, Mamba-based models like Jamba, or embedding models. Please check out <a href="https://docs.vllm.ai/en/latest/models/supported_models.html">our documentation</a> for a more detailed list of the supported models.</p>

<p><strong>Feature Limitations:</strong><br />
V1 currently lacks support for log probs, prompt log probs sampling parameters, pipeline parallelism, structured decoding, speculative decoding, prometheus metrics, and LoRA. We are actively working to close this feature gap and add brand-new optimizations to the V1 engine.</p>

<p><strong>Hardware Support:</strong><br />
V1 currently supports only Ampere or later NVIDIA GPUs. We are actively working to extend support to other hardware backends such as TPU.</p>

<p>Finally, please note that you can continue using V0 and maintain backward compatibility by not setting <code class="language-plaintext highlighter-rouge">VLLM_USE_V1=1</code>.</p>

<h1 id="how-to-get-started">How to Get Started</h1>

<p>To use vLLM V1:</p>
<ol>
  <li>Install the latest version of vLLM with <code class="language-plaintext highlighter-rouge">pip install vllm --upgrade</code>.</li>
  <li><strong>Set the environment variable <code class="language-plaintext highlighter-rouge">export VLLM_USE_V1=1</code>.</strong></li>
  <li>Use vLLM’s <a href="https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/basic.py">Python API</a> or OpenAI-compatible server (<code class="language-plaintext highlighter-rouge">vllm serve &lt;model-name&gt;</code>). You don’t need any change to the existing API.</li>
</ol>

<p>Please try it out and share your feedback!</p>

<h1 id="acknowledgment">Acknowledgment</h1>

<p>We gratefully acknowledge that the design of vLLM V1 builds upon and enhances several open-source LLM inference engines, including <a href="https://github.com/ModelTC/lightllm">LightLLM</a>, <a href="https://github.com/InternLM/lmdeploy">LMDeploy</a>, <a href="https://github.com/sgl-project/sglang">SGLang</a>, <a href="https://github.com/huggingface/text-generation-inference">TGI</a>, and <a href="https://github.com/NVIDIA/TensorRT-LLM">TRT-LLM</a>. These engines have significantly influenced our work, and we have gained valuable insights from them.</p>

<p>The V1 re-architecture is a continued joint effort across the entire vLLM team and community. Below is an incomplete list of contributors to this milestone:</p>

<ul>
  <li>UC Berkeley, Neural Magic (now Red Hat), Anyscale, and Roblox mainly drove the effort together.</li>
  <li><a href="https://github.com/WoosukKwon">Woosuk Kwon</a> initiated the project and implemented the scheduler and model runner.</li>
  <li><a href="https://github.com/robertgshaw2-redhat">Robert Shaw</a> implemented the optimized execution loop and API server.</li>
  <li><a href="https://github.com/comaniac">Cody Yu</a> implemented efficient prefix caching for text and image inputs.</li>
  <li><a href="https://github.com/ywang96">Roger Wang</a> led the overall enhanced MLLM support in V1.</li>
  <li><a href="https://github.com/youkaichao">Kaichao You</a> led the torch.compile integration and implemented the piecewise CUDA graphs.</li>
  <li><a href="https://github.com/tlrmchlsmth">Tyler Michael Smith</a> implemented the tensor parallelism support with Python multiprocessing.</li>
  <li><a href="https://github.com/ruisearch42">Rui Qiao</a> implemented the tensor parallelism support with Ray and is implementing pipeline parallelism support.</li>
  <li><a href="https://github.com/LucasWilkinson">Lucas Wilkinson</a> added support for FlashAttention 3.</li>
  <li><a href="https://github.com/alexm-redhat">Alexander Matveev</a> implemented the optimized preprocessor for multimodal inputs and is implementing TPU support.</li>
  <li><a href="https://github.com/sroy745">Sourashis Roy</a> implemented the logit penalties in the sampler.</li>
  <li><a href="https://github.com/DarkLight1337">Cyrus Leung</a> led the MLLM input processing refactoring effort and helped its integration to V1.</li>
  <li><a href="https://github.com/russellb">Russell Bryant</a> addressed several multiprocess-related issues.</li>
  <li><a href="https://github.com/njhill">Nick Hill</a> optimized the engine loop and API server.</li>
  <li><a href="https://github.com/rickyyx">Ricky Xu</a> and <a href="https://github.com/heheda12345">Chen Zhang</a> helped refactor the KV cache manager.</li>
  <li><a href="https://github.com/jeejeelee">Jie Li</a> and <a href="https://github.com/mgoin">Michael Goin</a> helped with MLLM support and optimization.</li>
  <li><a href="https://github.com/aarnphm">Aaron Pham</a> is implementing the structured decoding support.</li>
  <li><a href="https://github.com/varun-sundar-rabindranath">Varun Sundar Rabindranath</a> is implementing the multi-LoRA support.</li>
  <li><a href="https://github.com/afeldman-nm">Andrew Feldman</a> is implementing the log probs and prompt log probs support.</li>
  <li><a href="https://github.com/LiuXiaoxuanPKU">Lily Liu</a> is implementing the speculative decoding support.</li>
  <li><a href="https://github.com/KuntaiDu">Kuntai Du</a> is implementing the prefill disaggregation and KV Cache transfer support.</li>
  <li><a href="https://github.com/simon-mo">Simon Mo</a> and <a href="https://github.com/zhuohan123">Zhuohan Li</a> contributed to the V1 system design.</li>
</ul>]]></content><author><name>vLLM Team</name></author><summary type="html"><![CDATA[]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.vllm.ai/assets/figures/v1/vLLM_V1_Logo.png" /><media:content medium="image" url="https://blog.vllm.ai/assets/figures/v1/vLLM_V1_Logo.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”</title><link href="https://blog.vllm.ai/2025/01/21/stack-release.html" rel="alternate" type="text/html" title="High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”" /><published>2025-01-21T00:00:00-08:00</published><updated>2025-01-21T00:00:00-08:00</updated><id>https://blog.vllm.ai/2025/01/21/stack-release</id><content type="html" xml:base="https://blog.vllm.ai/2025/01/21/stack-release.html"><![CDATA[<p><br /></p>

<h2 id="tldr">TL;DR</h2>
<ul>
  <li><strong>vLLM</strong> boasts the largest open-source community, but what does it take to transform vLLM from the best single-node LLM engine to a premier LLM serving system?</li>
  <li><strong>Today, we release “vLLM production-stack”</strong>, a vLLM-based full inference stack that introduces two major advantages:
    <ul>
      <li><strong>10x better performance</strong> (3-10x lower response delay &amp; 2-5x higher throughput) with prefix-aware request routing and KV-cache sharing.</li>
      <li><strong>Easy cluster deployment</strong> with built-in support for fault tolerance, autoscaling, and observability.</li>
    </ul>
  </li>
  <li>And the best part? It’s <strong>open-source</strong>—so everyone can get started right away! <a href="https://github.com/vllm-project/production-stack">[<strong>https://github.com/vllm-project/production-stack</strong>]</a></li>
</ul>

<h1 id="the-context">The Context</h1>
<!-- Over the past year, LLM inference has raced to the forefront, powering everything from chatbots to code assistants and beyond. It’s quickly becoming critical infrastructure, much like the cloud was to big data, cellular was to mobile apps, and CDNs were (and still are!) to the broader Internet. -->

<p><em>In the AI arms race, it’s no longer just about who has the best model—it’s about <strong>who has the best LLM serving system</strong>.</em></p>

<p><strong>vLLM</strong> has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on <strong>single-node</strong> deployments.</p>

<p>How do we extend its power into a <strong>full-stack</strong> inference system that any organization can deploy at scale with <em>high reliability</em>, <em>high throughput</em>, and <em>low latency</em>? That’s precisely why the LMCache team and the vLLM team built <strong>vLLM production-stack</strong>.</p>

<div align="center">
<img src="/assets/figures/stack/stack-thumbnail.png" alt="Icon" style="width: 60%; vertical-align:middle;" />
</div>

<h1 id="introducing-vllm-production-stack">Introducing “<em>vLLM Production-Stack</em>”</h1>
<p><strong>vLLM Production-stack</strong> is an open-source <strong>reference implementation</strong> of an <strong>inference stack</strong> built on top of vLLM, designed to run seamlessly on a cluster of GPU nodes. It adds four critical functionalities that complement vLLM’s native strengths:</p>
<ul>
  <li><strong>KV cache sharing &amp; storage</strong> to speed up inference when context is reused (powered by the <a href="https://github.com/LMCache/LMCache"><strong>LMCache</strong></a> project).</li>
  <li><strong>Prefix-aware routing</strong> that sends queries to the vLLM instance already holding the relevant context KV cache.</li>
  <li><strong>Observability</strong> of individual engine status and query-level metrics (TTFT, TBT, throughput).</li>
  <li><strong>Autoscaling</strong> to handle dynamics of workloads.</li>
</ul>

<h3 id="comparison-with-alternatives">Comparison with Alternatives:</h3>

<p>Below is a quick snapshot comparing vLLM production-stack with its closest counterparts:</p>
<div align="center">
<img src="/assets/figures/stack/stack-table.png" alt="Icon" style="width: 90%; vertical-align:middle;" />
</div>

<h3 id="the-design">The Design</h3>
<p>The vLLM production-stack architecture builds on top of vLLM’s powerful single-node engine to provide a cluster-wide solution.</p>

<p>At a high level:</p>
<ul>
  <li>Applications send LLM inference requests.</li>
  <li>Prefix-aware routing checks if the requested context is already cached within the memory pool of one instance. It then forwards the request to the node with the pre-computed cache.</li>
  <li>Autoscaling and a cluster manager watch the overall load and spin up new vLLM nodes if needed.</li>
  <li>Observability modules gather metrics like TTFT (Time-To-First-Token), TBT (Time-Between-Tokens), and throughput, giving you real-time insights into your system’s health.</li>
</ul>

<div align="center">
<img src="/assets/figures/stack/stack-overview-2.png" alt="Icon" style="width: 90%; vertical-align:middle;" />
</div>

<h1 id="advantage-1-easy-deployment">Advantage #1: Easy Deployment</h1>

<p>Use helm chart to deploy the vLLM production-stack to your k8s cluster through running a single command:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo helm repo add llmstack-repo https://lmcache.github.io/helm/ &amp;&amp;\
  sudo helm install llmstack llmstack-repo/vllm-stack 
</code></pre></div></div>

<p>For more details, please refer to the detailed README at <a href="https://github.com/vllm-project/production-stack">vLLM production-stack repo</a>. <a href="https://github.com/LMCache/LMStack/tree/main/tutorials">Tutorials</a> about setting up k8s cluster and customizing helm charts are also available.</p>

<h1 id="advantage-2-better-performance">Advantage #2: Better Performance</h1>
<p>We conduct a benchmark of multi-round Q&amp;A workload on vLLM production-stack and other setups, including vLLM + KServe and an commercial endpoint service.
The results show vLLM stack outperforms other setups across key metrics (time to first token and inter token latency).</p>

<div align="center">
<img src="/assets/figures/stack/stack-ttft.png" alt="Icon" style="width: 60%; vertical-align:middle;" />
</div>

<div align="center">
<img src="/assets/figures/stack/stack-itl.png" alt="Icon" style="width: 60%; vertical-align:middle;" />
</div>

<h1 id="advantage-3-effortless-monitoring">Advantage #3: Effortless Monitoring</h1>
<p>Keep real-time tracking of your LLM inference cluster with key metrics including latency distributions, number of requests over time, KV cache hit rate.</p>

<div align="center">
<img src="/assets/figures/stack/stack-panel.png" alt="Icon" style="width: 70%; vertical-align:middle;" />
</div>

<h2 id="conclusion">Conclusion</h2>
<p>We’re thrilled to unveil <strong>vLLM Production Stack</strong>—the next step in transforming vLLM from a best-in-class single-node engine into a full-scale LLM serving system. 
We believe the vLL stack will open new doors for organizations seeking to build, test, and deploy LLM applications at scale without sacrificing performance or simplicity.</p>

<p>If you’re as excited as we are, don’t wait!</p>
<ul>
  <li><strong>Clone the repo: <a href="https://github.com/vllm-project/production-stack">https://github.com/vllm-project/production-stack</a></strong></li>
  <li><strong>Kick the tires</strong></li>
  <li><strong>Let us know what you think!</strong></li>
  <li><strong><a href="https://forms.gle/mQfQDUXbKfp2St1z7">Interest Form</a></strong></li>
</ul>

<p>Join us to build a future where every application can harness the power of LLM inference—reliably, at scale, and without breaking a sweat.
<em>Happy deploying!</em></p>

<p>Contacts:</p>
<ul>
  <li><strong>vLLM <a href="https://slack.vllm.ai/">slack</a></strong></li>
  <li><strong>LMCache <a href="https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ">slack</a></strong></li>
</ul>]]></content><author><name>LMCache Team</name></author><summary type="html"><![CDATA[]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.vllm.ai/assets/figures/stack/stack-thumbnail.png" /><media:content medium="image" url="https://blog.vllm.ai/assets/figures/stack/stack-thumbnail.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Structured Decoding in vLLM: a gentle introduction</title><link href="https://blog.vllm.ai/2025/01/14/struct-decode-intro.html" rel="alternate" type="text/html" title="Structured Decoding in vLLM: a gentle introduction" /><published>2025-01-14T00:00:00-08:00</published><updated>2025-01-14T00:00:00-08:00</updated><id>https://blog.vllm.ai/2025/01/14/struct-decode-intro</id><content type="html" xml:base="https://blog.vllm.ai/2025/01/14/struct-decode-intro.html"><![CDATA[<p><strong>TL/DR</strong>:</p>

<ul>
  <li>Structured decoding allows precise control over LLM output formats</li>
  <li>vLLM now supports both <a href="https://github.com/dottxt-ai/outlines">outlines</a> and <a href="https://github.com/mlc-ai/xgrammar">XGrammar</a> backends for structured decoding</li>
  <li>Recent XGrammar integration brings up to 5x improvement in time per output token (TPOT) under load</li>
  <li>Upcoming v1 release focuses on enhanced performance and schedule-level mask broadcasting for mixed-requests batch support</li>
</ul>

<p><em><a href="https://blog.vllm.ai/2023/06/20/vllm.html">vLLM</a> is the high-throughput and efficient inference engine for running <strong>large-language models</strong> (LLMs). In this post, we will explore the annotated history of language models, describe the current state of structured decoding in vLLM, as well as the recent integration with <a href="https://github.com/vllm-project/vllm/pull/10785">XGrammar</a>, and <a href="https://github.com/vllm-project/vllm/issues/8779">share our tentative roadmap for future improvements</a>.</em></p>

<blockquote>
  <p>We would also invite users to tackle this blog post from a philosophical perspective, and in the process trying to posit that structured decoding represents a fundamental shift in how we think about LLM outputs. It also plays an important role in building complex agentic system.</p>
</blockquote>

<p>For more information about vLLM, please check out our <a href="https://docs.vllm.ai/en/latest/">documentation</a>.</p>

<h2 id="language-models-a-brief-historical-context">Language models: A brief historical context</h2>

<p>In 1950, Alan Turing proposed that a high-speed digital computer, programmed with rules, could exhibit emergent behaviour of intelligence (Turing, 1950). This led to two main approaches in AI development:</p>

<ol>
  <li>
    <p>Good Old-Fashioned AI (GOFAI): A paradigm quickly emerged among researchers in the 1950s, where expert systems were designed to replicate the decision-making capabilities of a human specialist<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, (or symbolic reasoning system), referred to by Haugland as Good Old-Fashioned AI (GOFAI) (Haugeland, 1997). However, it quickly ran into funding problems due to its semantic representation not being able to scale up to generalised tasks (Also known as the “AI Winter” (Hendler, 2008)).</p>
  </li>
  <li>
    <p>New-Fangled AI (NFAI): Concurrently, Donald Norman’s Parallel Distributed Processing (Rumelhart et al., 1986) group investigated variations of Rosenblatt’s perception (Rosenblatt, 1958), where they proposed <em>hidden layers</em> within the network alongside with inputs and outputs to extrapolate appropriate responses based on what it had learned during training process. These connectionist networks were often built on top of statistical methods<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>. Given the abundance of data and Moore’s Law<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> resulting in an unprecedented amount of compute available, we see the complete dominance of connectionist networks in both research and production use-cases, most notably variants of <em>decoder-only</em> transformers<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> for <em>text generations</em> tasks. As such, most modern transformers variants are considered <strong>NFAI</strong> systems.</p>
  </li>
</ol>

<p>In summary:</p>

<ul>
  <li>GOFAI are <em>deterministic</em> and rule-based, given its intentionality is injected through explicit programming</li>
  <li>NFAI are often considered as “black-box” models (in: input - out: some output), data-driven given the networked complexity nature of its internal representations</li>
</ul>

<h2 id="why-do-we-need-structured-decoding">Why do we need structured decoding?</h2>

<figure>
  <img src="/assets/figures/struct-decode-intro/shogoth-gpt.png" />
<figcaption>
Shogoth as GPTs. In a sense, RLHF, or any post-training methods, is an injection of rules (a GOFAI system) into any large compound AI systems
</figcaption>
</figure>

<p>LLMs excel at the following heuristic: given a blob of text, the model will generate a contiguous piece of text that it predicts as the most probable tokens. For example, if you give it a Wikipedia article, the model should produce text consistent with the remainder of said article.</p>

<p>These models work well given the following assumption: the input prompt must be coherent and well-structured surrounding a given problem the users want to achieve. In other words, LLMs can be unpredictable when you need output in specific formats. Think of asking a model to generate JSON - without guidance, it might produce valid text that breaks JSON specification<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup>.</p>

<p>This is where structured decoding comes in. It enables LLMs to generate outputs that follow a desired structure while preserving the non-deterministic nature of the system.</p>

<p>Companies like OpenAI have recognized this need, implementing features like <a href="https://platform.openai.com/docs/guides/structured-outputs#json-mode">JSON mode</a> to constrain<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup> the output format. If you have built with these functionalities before (such as agentic workflows, function calling, coding assistant), chances are you are using structured decoding under the hood.</p>

<blockquote>
  <p>Guided decoding is to LLMs what <strong>validation</strong> is to APIs - it acts as a guarantee that what comes out matches what you expect. Guided decoding ensures structure integrity that allows developers to integrate LLMs into their application with ease!</p>
</blockquote>

<h2 id="structured-decoding-and-vllm">Structured decoding and vLLM</h2>

<p>In simple terms, structured decoding gives LLMs a “template” to follow. Users provide a schema that “influences” the model’s output, ensuring compliance with the desired structure:</p>

<p><img src="/assets/figures/struct-decode-intro/mermaid-intro.svg" alt="top level view of structure decoding" /></p>

<p>From a technical perspective, an inference engine can modify the probability distribution for next-tokens by applying bias (often via logit masks) for all tokens from any given schemas. To apply these biases, <a href="https://github.com/dottxt-ai/outlines">outlines</a> proposed guided generations via finite-state machine (FSM) for any given schemas (Willard &amp; Louf, 2023). This allows us to track the current state during decoding and filter out invalid tokens by applying logit bias to the output.</p>

<figure>
  <img src="/assets/figures/struct-decode-intro/constrained-json-fsm.webp" />
<figcaption>
courtesy of <a href="https://lmsys.org/blog/2024-02-05-compressed-fsm/" target="_blank">LMSys, 2024</a>.
</figcaption>
</figure>

<p><em>in vLLM, you can use this by passing a JSON schema to the sampling params (either through Python SDK or HTTP requests).</em></p>

<blockquote>
  <p>Note: in some cases, it can even <a href="https://blog.dottxt.co/coalescence.html">improve</a> the native decoding performance for LLM!</p>
</blockquote>

<h3 id="previous-limitations-in-vllm">Previous limitations in vLLM</h3>

<p>There are few limitations with current vLLM’s support of the Outlines backend:</p>

<ol>
  <li><strong>Slow decoding</strong>: FSM has to be constructed at a token-level, meaning it can only transition the state one token per step. Therefore, it can only decode <em>one</em> token at a time, resulting in slow decoding.</li>
  <li><strong>Batch processing bottlenecks</strong>: Implementation in <a href="https://github.com/vllm-project/vllm/blob/80c751e7f68ade3d4c6391a0f3fce9ce970ddad0/vllm/model_executor/guided_decoding/outlines_logits_processors.py">vLLM</a> relies heavily on logit processor<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup>. As such, this is on the critical path of the sampling process. In batching use-case, compiling FSM per requests as well as computing the mask synchronous means that <strong>all requests</strong> in any given batches will get blocked, resulting in high time-to-first-tokens (TTFT) and lower throughput.
    <ul>
      <li>We found that compiling FSM is proven to be a relatively expensive task, making it a significant contributor to the increased TTFT.</li>
    </ul>
  </li>
  <li><strong>Performance issues with CFG mode</strong>: With outlines integrations, while JSON mode is relatively fast, the CFG mode runs significantly slower, and can occasionally <a href="https://github.com/vllm-project/vllm/issues/10081">crashes</a> the engine.</li>
  <li><strong>Limited advanced feature support</strong>: Techniques like <a href="https://lmsys.org/blog/2024-02-05-compressed-fsm/">jump-forward decoding</a> are currently not possible with logit-processor approach. It requires prefilling a set of k-next tokens, whereas for logit processors we can only deal with the next-token.</li>
</ol>

<h3 id="integration-with-xgrammar">Integration with XGrammar</h3>

<p><a href="https://github.com/mlc-ai/xgrammar">XGrammar</a> introduces a new technique that batch constrained decoding via pushdown automaton (PDA). You can think of a PDA as a “collection of FSMs, and each FSM represents a context-free grammar (CFG).” One significant advantage of PDA is its recursive nature, allowing us to execute multiple state transitions. They also include additional <a href="https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar">optimisation</a> (for those who are interested) to reduce grammar compilation overhead.</p>

<p>This advancement addresses <strong>limitation (1)</strong> by moving grammar compilation out of Python into C, utilising <code class="language-plaintext highlighter-rouge">pthread</code>. Additionally, XGrammar lays the groundwork for addressing <strong>limitation (4)</strong> in future releases. Below are performance comparisons between the XGrammar and Outlines backends:</p>

<figure>
  <img src="/assets/figures/struct-decode-intro/vllm-new-xgrammar.png" />
  <img src="/assets/figures/struct-decode-intro/vllm-xgrammar-decode-time-per-output-token.png" />
<figcaption>
courtesy of Michael Goin (Red Hat).
</figcaption>
</figure>

<p>In vLLM’s v0 architecture, we’ve implemented XGrammar as a <a href="https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/guided_decoding/xgrammar_decoding.py">logit processor</a>, optimizing it with caching for tokenizer data. While the performance improvements are encouraging, we believe there’s still significant room for optimization.</p>

<p>There are still a few usability concerns in XGrammar v0 integration to match feature parity with all use cases:</p>

<ul>
  <li>It is yet to support grammars other than GBNF format (PR on vLLM: <a href="https://github.com/vllm-project/vllm/pull/10870">github</a>)</li>
  <li>It is yet to support regex</li>
  <li>It is yet to support complex JSON that uses regex patterns or numeric ranges
    <ul>
      <li>There are a few PR trying to cover this usage. There was one <a href="https://github.com/vllm-project/vllm/pull/10899">bugfix PR on vLLM</a> and one <a href="https://github.com/mlc-ai/xgrammar/pull/106">upstream</a></li>
    </ul>
  </li>
</ul>

<blockquote>
  <p>vLLM now has a basic support for XGrammar by default. In case where we know XGrammar is insufficient to serve the request, we fall back to Outlines.</p>

  <p>Note that vLLM also includes support for lm-format-enforcer. However, from our testing we found that in some long context test cases, lm-format-enforcer fails to enforce correct outputs, and not up to par with Outlines in terms of performance.</p>
</blockquote>

<h2 id="tentative-plans-for-v1">Tentative plans for v1</h2>

<p>With the release of <a href="https://github.com/vllm-project/vllm/issues/8779">v1</a> on the horizon, we’re working on a tentative plan for structured decoding:</p>

<ol>
  <li>Moving guided decoding towards scheduler-level:
    <ul>
      <li>Reason: We have more context regarding which requests that use structured decoding at a scheduler-level, therefore it shouldn’t block other requests within the batch (tentatively addressing <strong>limitation (2)</strong>). In a sense, this moves guided decoding outside of the critical path.</li>
      <li>This would allow for more natural vertical integration with jump-forward decoding (address <strong>limitation (4)</strong>).</li>
    </ul>
  </li>
  <li>Allowing bit-mask calculation in one process instead of each GPU workers
    <ul>
      <li>Reason: We can broadcast this bit-mask to each GPU worker instead of repeating this process per GPU worker.</li>
      <li>We will look to carefully analyze the bandwidth implications of broadcasting masks for every sample per request that use guided decoding.</li>
    </ul>
  </li>
  <li>Good baseline for speculative decoding and tool-use
    <ul>
      <li>Reason: XGrammar includes plans to support tool-use, such that we can move away from Python’s <a href="https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/openai/tool_parsers">tool parser</a>.</li>
      <li>Tree scoring in speculative decoding can then use the same API as jump-forward decoding (which depends on the integration of guided decoding at the scheduler level).</li>
    </ul>
  </li>
</ol>

<p><em>NOTE: if you have any more suggestions we are more than happy to take it into consideration. Consider joining <a href="https://www.notion.so/bentoml/slack.vllm.ai">vLLM slack</a> via <code class="language-plaintext highlighter-rouge">#feat-structured-output</code>.</em></p>

<h2 id="acknowledgements">Acknowledgements</h2>

<p>We want to thank the vLLM team, XGrammar team, <a href="https://github.com/aarnphm">Aaron Pham (BentoML)</a>, <a href="https://github.com/mgoin">Michael Goin (Red Hat)</a>, <a href="https://github.com/xuechendi">Chendi Xue (Intel)</a>, and <a href="https://github.com/russellb">Russell Bryant (Red Hat)</a> for their valuable feedback and collaboration on bringing XGrammar to vLLM and the continuous effort to improve structured decoding in vLLM.</p>

<h2 id="references">References</h2>

<ul>
  <li>Bahdanau, D., Cho, K., &amp; Bengio, Y. (2016). <em>Neural Machine Translation by Jointly Learning to Align and Translate</em>. arXiv preprint arXiv:1409.0473</li>
  <li>Haugeland, J. (1997). <em>Mind Design II: Philosophy, Psychology, and Artificial Intelligence</em>. The MIT Press. <a href="https://doi.org/10.7551/mitpress/4626.001.0001">https://doi.org/10.7551/mitpress/4626.001.0001</a></li>
  <li>Hendler, J. (2008). Avoiding Another AI Winter. <em>IEEE Intelligent Systems</em>, <em>23</em>(2), 2–4. <a href="https://doi.org/10.1109/MIS.2008.20">https://doi.org/10.1109/MIS.2008.20</a></li>
  <li>Hochreiter, S., &amp; Schmidhuber, J. (1997). Long Short-Term Memory. <em>Neural Computation</em>.</li>
  <li>Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., &amp; Amodei, D. (2020). <em>Scaling Laws for Neural Language Models</em>. arXiv preprint arXiv:2001.08361</li>
  <li>Mikolov, T., Chen, K., Corrado, G., &amp; Dean, J. (2013). <em>Efficient Estimation of Word Representations in Vector Space</em>. arXiv preprint arXiv:1301.3781</li>
  <li>Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. <em>Psychological Review</em>, <em>65</em>(6), 386–408. <a href="https://doi.org/10.1037/h0042519">https://doi.org/10.1037/h0042519</a></li>
  <li>Rumelhart, D. E., McClelland, J. L., &amp; Group, P. R. (1986). <em>Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations</em>. The MIT Press. <a href="https://doi.org/10.7551/mitpress/5236.001.0001">https://doi.org/10.7551/mitpress/5236.001.0001</a></li>
  <li>Shortliffe, E. H. (1974). <em>MYCIN: A Rule-Based Computer Program for Advising Physicians Regarding Antimicrobial Therapy Selection</em> (Technical Report STAN-CS-74-465). Stanford University.</li>
  <li>Statistical Machine Translation. (n.d.). <em>IBM Models</em>. Statistical Machine Translation Survey. <a href="http://www2.statmt.org/survey/Topic/IBMModels">http://www2.statmt.org/survey/Topic/IBMModels</a></li>
  <li>Turing, A. M. (1950). i.—Computing Machinery And Intelligence. <em>Mind</em>, <em>LIX</em>(236), 433–460. <a href="https://doi.org/10.1093/mind/LIX.236.433">https://doi.org/10.1093/mind/LIX.236.433</a></li>
  <li>Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., &amp; Polosukhin, I. (2023). <em>Attention Is All You Need</em>. arXiv preprint arXiv:1706.03762</li>
  <li>Willard, B. T., &amp; Louf, R. (2023). <em>Efficient Guided Generation for Large Language Models</em>. arXiv preprint arXiv:2307.09702</li>
</ul>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">

      <p>Allen Newell and Herbert Simon’s work at RAND initially showed that computers can simulate important aspects of intelligence.</p>

      <p>Another notable application was found in the medical domain (Haugeland, 1997). MYCIN, developed at Stanford University in the 1970s, diagnosed and recommended treatments for blood infections (Shortliffe, 1974). MYCIN’s developers recognized the importance of justifying recommendations, implementing what were known as “rule traces” to explain the system’s reasoning in human-understandable terms. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">

      <p>In the 1990s, IBM released a sequence of complex statistical models that is trained to perform machine translations <a href="https://en.wikipedia.org/wiki/IBM_alignment_models">tasks</a> (Statistical Machine Translation, n.d.) (see also: this <a href="https://www.cs.cornell.edu/courses/cs5740/2017sp/lectures/08-alignments.pdf">lecture</a> from Cornell).</p>

      <p>In 2001, Bag of words (BoW)-variants model was trained on 0.3B tokens and was considered SOTA at the time (Mikolov et al., 2013). These earlier works proved to the research community that statistical modelling triumphs over symbolic counterpart for language processing given it can capture the general patterns for large corpuses of text. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">

      <p>In 2017, The landmark paper “Attention is all You Need” introduced Transformers architecture (Vaswani et al., 2023) for neural machine translations tasks, which is based on the attention mechanism first proposed by (Bahdanau et al., 2016).</p>

      <p>OpenAI then introduced the scaling law for neural language models (Kaplan et al., 2020), which sets off the race towards building these systems based on foundational language models. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">

      <p>Prior to Attention-based transformers, seq-to-seq models uses RNNs given its ability for longer context length and better memory. However, they are more susceptible to vanishing/exploding gradients comparing to feed-forward network, and thus LSTM (Hochreiter &amp; Schmidhuber, 1997) was proposed to solve this problem. Yet, one of the main problems with LSTM is that they tend to have poor memory recall with data they have seen many steps ago.</p>

      <p>The Attention paper addresses this problem by encoding additional positional data into the inputs. The paper also additionally proposed a encoder-decoder architecture for translation tasks, however, most of text-generation models nowadays are decoder-only, given its superior performance over zero-shot tasks.</p>

      <p>One of the many reasons why attention-based transformers works better than LSTM is because transformers are very scalable and hardware-aware (you can’t just arbitrary add more LSTM block and hope for better long-term retention). For more information, please refer back to the original paper. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">

      <p>One might argue that we can reliably achieve these through few-shot promptings, i.e “Give me a JSON that yields the address of users. Example output can be …”. However, there is no guarantee that the generated outputs is a valid JSON. This is because these models are probabilistic systems, as they are “sampling” the next results based on the distribution of data that it was trained on.</p>

      <p>One might also argue that one should use specific fine-tuned models for JSON outputs to perform such cases. However, fine-tuning often requires extensive training and a lot more labor to curate data, monitor progress, and perform evaluation, which is a huge resources not everyone can afford to do. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:6" role="doc-endnote">
      <p>Note that the phrase “[structured/constrained/guided] decoding” are used interchangeably, but they all refer to the same mechanism of “using a format for the model to structurally sampling outputs.” <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:7" role="doc-endnote">
      <p>See this <a href="https://huggingface.co/blog/logits-processor-zoo">blog post</a> from HuggingFace for using logit processors to control the generation process. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Guest Post by BentoML and Red Hat</name></author><summary type="html"><![CDATA[TL/DR:]]></summary></entry><entry><title type="html">vLLM 2024 Retrospective and 2025 Vision</title><link href="https://blog.vllm.ai/2025/01/10/vllm-2024-wrapped-2025-vision.html" rel="alternate" type="text/html" title="vLLM 2024 Retrospective and 2025 Vision" /><published>2025-01-10T00:00:00-08:00</published><updated>2025-01-10T00:00:00-08:00</updated><id>https://blog.vllm.ai/2025/01/10/vllm-2024-wrapped-2025-vision</id><content type="html" xml:base="https://blog.vllm.ai/2025/01/10/vllm-2024-wrapped-2025-vision.html"><![CDATA[<p>The vLLM community achieved remarkable growth in 2024, evolving from a specialized inference engine to become the de facto serving solution for the open-source AI ecosystem. This transformation is reflected in our growth metrics:</p>

<ul>
  <li>GitHub stars grew from 14,000 to 32,600 (2.3x)</li>
  <li>Contributors expanded from 190 to 740 (3.8x)</li>
  <li>Monthly downloads surged from 6,000 to 27,000 (4.5x)</li>
  <li>GPU hours increased approximately 10x over the last six months</li>
  <li>Explore more usage data at <a href="https://2024.vllm.ai">https://2024.vllm.ai</a></li>
</ul>

<p>vLLM has established itself as the leading open-source LLM serving and inference engine, with widespread adoption in production applications (e.g., powering Amazon Rufus and LinkedIn AI features). Our bi-monthly meetups have become strategic gatherings for partnerships with industry leaders like IBM, AWS, and NVIDIA, marking our progress toward becoming the universal serving solution for the open-source AI ecosystem. Read on for more details about vLLM’s 2024 achievements and 2025 roadmap!</p>

<p><em>This blog is based on the 16th session of the bi-weekly <a href="https://hubs.li/Q02TFDTT0">vLLM Office Hours</a>. Watch the recording <a href="https://www.youtube.com/watch?v=xmz8lHsrbGM">here</a>.</em></p>

<hr />

<h2 id="2024-achievements-scaling-models-hardware-and-features">2024 Achievements: Scaling Models, Hardware, and Features</h2>

<h3 id="community-contributions-and-growth">Community Contributions and Growth</h3>

<figure>
  <img src="/assets/figures/vllm-2024-wrapped-2025-roadmap/vllm-contributor-groups.png" />
<figcaption>
vLLM Main Contributor Groups (by Commits)
</figcaption>
</figure>

<p>2024 was an exceptional year for vLLM! Our contribution community has expanded dramatically to include:</p>

<ul>
  <li>15+ full-time contributors across 6+ organizations</li>
  <li>20+ active organizations as key stakeholders and sponsors</li>
  <li>Contributions from top institutions including UC Berkeley, Neural Magic, Anyscale, Roblox, IBM, AMD, Intel, and NVIDIA, as well as individual developers worldwide</li>
  <li>A thriving ecosystem connecting model creators, hardware vendors, and optimization developers</li>
  <li>Well-attended bi-weekly office hours facilitating transparency, community growth, and strategic partnerships</li>
</ul>

<p>These numbers reflect more than growth—they demonstrate vLLM’s role as critical infrastructure in the AI ecosystem, supporting everything from research prototypes to production systems serving millions of users.</p>

<h3 id="expanding-model-support">Expanding Model Support</h3>

<figure>
  <img src="/assets/figures/vllm-2024-wrapped-2025-roadmap/model-architecture-serving-usage.png" />
<figcaption>
Usage by Model Architecture in Serving
</figcaption>
</figure>

<p>At the beginning of 2024, vLLM supported only a handful of models. By year’s end, the project had evolved to support performant inference for almost <a href="https://docs.vllm.ai/en/latest/models/supported_models.html"><strong>100 model architectures</strong></a>: spanning nearly every prominent open-source large language model (LLM), multimodal (image, audio, video), encoder-decoder, speculative decoding, classification, embedding, and reward models. Notably, vLLM introduced production support for state-space language models, exploring the future of non-transformer language models.</p>

<h3 id="broadening-hardware-compatibility">Broadening Hardware Compatibility</h3>

<figure>
  <img src="/assets/figures/vllm-2024-wrapped-2025-roadmap/gpu-hours-by-vendor.png" />
<figcaption>
GPU Hours Breakdown by Hardware Vendor
</figcaption>
</figure>

<p>From the initial hardware target of NVIDIA A100 GPUs, vLLM has expanded to support:</p>

<ul>
  <li><strong>NVIDIA GPUs:</strong> First-class optimizations for H100, with support for every NVIDIA GPU from V100 and newer.</li>
  <li><strong>AMD GPUs:</strong> Support for MI200, MI300, and Radeon RX 7900 series - with rapidly growing adoption for MI300X.</li>
  <li><strong>Google TPUs:</strong> Support for TPU v4, v5p, v5e, and the latest v6e.</li>
  <li><strong>AWS Inferentia and Trainium:</strong> Supports for trn1/inf2 instances.</li>
  <li><strong>Intel Gaudi (HPU) and GPU (XPU):</strong> Leveraging Intel GPU and Gaudi architectures for AI workloads.</li>
  <li><strong>CPUs:</strong> Featuring support for a growing list of ISAs - x86, ARM, and PowerPC.</li>
</ul>

<p>vLLM’s hardware compatibility has broadened to address diverse user requirements while incorporating performance improvements. Importantly, vLLM is on the path to ensure that all models work on all hardware platforms, with all the optimizations enabled.</p>

<h3 id="delivering-key-features">Delivering Key Features</h3>

<figure>
  <img src="/assets/figures/vllm-2024-wrapped-2025-roadmap/quantization-deployment-percentage.png" />
<figcaption>
Increasing Percentage of vLLM Deployments with Quantization
</figcaption>
</figure>

<p>vLLM’s 2024 development roadmap emphasized performance, scalability, and usability:</p>

<ul>
  <li><strong>Weight and Activation Quantization:</strong> Added support for diverse quantization methods and kernels, enabling efficient inference across hardware platforms. Notable integrations include activation quantization for FP8+INT8, Marlin+Machete kernels for GPTQ/AWQ/wNa16, FP8 KV Cache, AQLM, QQQ, HQQ, bitsandbytes, and GGUF. Over 20% of vLLM deployments now use quantization.</li>
  <li><strong>Automatic Prefix Caching:</strong> Reduced costs and improved latency for context-heavy applications.</li>
  <li><strong>Chunked Prefill:</strong> Enhanced stability of inter-token latency for interactive applications.</li>
  <li><strong>Speculative Decoding:</strong> Accelerated token generation through simultaneous token prediction and validation, supporting draft models, n-gram matching in prompts, and MLP speculators like Medusa or EAGLE.</li>
  <li><strong>Structured Outputs:</strong> Provided high-performance capabilities for applications requiring specific formats like JSON or pydantic schemas.</li>
  <li><strong>Tool Calling:</strong> Enabled models with supported chat templates to generate tool calls autonomously, facilitating data processing and agentic flows.</li>
  <li><strong>Distributed Inference:</strong> Introduced pipeline parallelism and disaggregated prefill to effectively scale workloads across GPUs and nodes.</li>
</ul>

<hr />

<h2 id="our-2025-vision">Our 2025 Vision</h2>

<p>In 2025, we anticipate a significant push in the boundaries of scaling for both pretraining and inference-time scaling. We believe that open-source models are rapidly catching up to proprietary ones, and through distillation, these massive models are becoming smaller, more intelligent, and more practical for production deployment.</p>

<h3 id="emerging-model-capabilities-gpt-4o-class-models-served-on-single-node">Emerging Model Capabilities: GPT-4o Class Models served on single node</h3>

<p>Our vision is ambitious yet concrete: enabling GPT-4o level performance on a single GPU, GPT-4o on a single node, and next generation scale capabilities on a modest cluster. To achieve this, we’re focusing on three key optimization frontiers:</p>

<ul>
  <li>
    <p>KV cache and attention optimization with sliding windows, cross-layer attention, and native quantization</p>
  </li>
  <li>
    <p>MoE optimizations targeting architecture with shared experts and large numbers of fine-grained experts</p>
  </li>
  <li>
    <p>Extended long context support through alternative architectures like state space models</p>
  </li>
</ul>

<p>Beyond raw performance, we’re tailoring vLLM for specialized vertical applications. Each use case demands specific optimizations: reasoning applications need custom tokens and flexible reasoning steps, coding requires fill-in-the-middle capabilities and prompt lookup decoding, agent frameworks benefit from tree-based caching, and creative applications need diverse sampling strategies including beam search variants and contrastive decode.</p>

<p>We’re also expanding vLLM’s role in the model training process. Recent adoption by prominent researchers like John Schulman signals our growing importance in post-training workflows. We’ll provide tight integration with data curation and post-training processes, making vLLM an essential tool across the full AI development lifecycle.</p>

<h3 id="practical-scale-powering-thousands-of-production-clusters">Practical Scale: Powering Thousands of Production Clusters</h3>

<p>As LLMs become the backbone of modern applications, we envision vLLM powering thousands of production clusters running 24/7. These aren’t experimental deployments—they’re mission-critical systems handling constant traffic for product features, maintained by dedicated platform teams.</p>

<p>To support this scale, we’re making vLLM truly battery-included for production applications. Quantization, prefix caching, and speculative decoding will become default features rather than optional optimizations. Structured output generation will be standard rather than exceptional. We’re developing comprehensive recipes for routing, caching, and auto-scaling that span the full lifecycle of production deployments.</p>

<p>As deployments scale beyond single replicas, we’re creating stable interfaces for cluster-level solutions. This includes robust default configurations tuned for popular models and hardware platforms, along with flexible optimization paths for diverse use cases. We’re fostering a community dedicated to pushing the boundaries of vLLM efficiency, ensuring our platform evolves to meet new challenges.</p>

<h3 id="open-architecture-the-foundation-of-our-future">Open Architecture: The Foundation of Our Future</h3>

<p>The key to vLLM’s continued success lies in its open architecture. We’re shipping a ground-up rearchitecture with our V1 release that exemplifies this philosophy. Every component – from model architectures to scheduling policies, memory management to sampling strategies – is designed to be modified and extended in both research and private forks.</p>

<p>Our commitment to openness extends beyond just code. We’re introducing:</p>

<ul>
  <li>
    <p>Pluggable architectures for seamless integration of new models, hardware backends, and custom extensions</p>
  </li>
  <li>
    <p>First-class <code class="language-plaintext highlighter-rouge">torch.compile</code> support, enabling custom operation fusion passes and rapid experimentation</p>
  </li>
  <li>
    <p>A flexible component system that supports private extensions while maintaining core stability</p>
  </li>
</ul>

<p>We’re doubling down on community development, coordinating engineering efforts across organizations while celebrating ecosystem projects. This includes growing our core team through a clear recruitment process and organizational structure. The goal isn’t just to make vLLM the best choice technically – it’s to ensure that everyone who invests in vLLM finds themselves better off for having done so.</p>

<p>Our architecture is more than just a technical choice; it’s a commitment to creating a connected ecosystem through extensibility and modification rather than lock-in. By making vLLM both powerful and customizable, we ensure its place at the heart of the AI inference ecosystem.</p>

<hr />

<h2 id="a-bit-of-reflection">A Bit of Reflection</h2>

<p>As we reflect on vLLM’s journey, some key themes emerge that have shaped our growth and continue to guide our path forward.</p>

<h3 id="building-bridges-in-the-ai-ecosystem">Building Bridges in the AI Ecosystem</h3>

<p>What started as an inference engine has evolved into something far more significant: a platform that bridges previously distinct worlds in the AI landscape. Model creators, hardware vendors, and optimization specialists have found in vLLM a unique amplifier for their contributions. When hardware teams develop new accelerators, vLLM provides immediate access to a broad application ecosystem. When researchers devise novel optimization techniques, vLLM offers a production-ready platform to demonstrate real-world impact. This virtuous cycle of <strong>contribution and amplification has become core to our identity</strong>, driving us to continuously improve the platform’s accessibility and extensibility.</p>

<h3 id="managing-growth-while-maintaining-excellence">Managing Growth While Maintaining Excellence</h3>

<p>Our exponential growth in 2024 brought both opportunities and challenges. The rapid expansion of our codebase and contributor base created unprecedented velocity, enabling us to tackle ambitious technical challenges and respond quickly to community needs. However, this growth also increased the complexity of our codebase. Rather than allowing technical debt to accumulate, we made the decisive choice to invest in our foundation. The second half of 2024 saw us undertake an ambitious redesign of vLLM’s core architecture, culminating in what we now call our V1 architecture. This wasn’t just a technical refresh – it was a deliberate move to ensure that our platform remains maintainable and modular as we scale to meet the needs of an expanding AI ecosystem.</p>

<h3 id="pioneering-a-new-model-of-open-source-development">Pioneering a New Model of Open Source Development</h3>

<p>Perhaps our most unique challenge has been <strong>building a world-class engineering organization</strong> through a network of sponsored volunteers. Unlike traditional open source projects that rely on funding from a single organization, vLLM is charting a different course. We’re creating a collaborative environment where multiple organizations contribute not just code, but resources and strategic direction. This model brings novel challenges in coordination, planning, and execution, but it also offers unprecedented opportunities for innovation and resilience. We’re learning – and sometimes inventing – best practices for everything from distributed decision-making to remote collaboration across organizational boundaries.</p>

<h3 id="our-unwavering-commitment">Our Unwavering Commitment</h3>

<p>Through all these changes and challenges, our fundamental mission remains clear: building the <strong>world’s fastest and easiest-to-use open-source LLM inference and serving engine</strong>. We believe that by lowering the barriers to efficient AI inference, we can help make advanced AI applications more practical and accessible for everyone. This isn’t just about technical excellence – it’s about creating a foundation that enables the entire AI community to move forward faster, together.</p>

<hr />

<h2 id="usage-data-collection">Usage Data Collection</h2>

<p>The metrics and insights throughout this post are powered by vLLM’s <a href="https://github.com/vllm-project/vllm/blob/main/vllm/usage/usage_lib.py">usage system</a>, which collects anonymized deployment data. Each vLLM instance generates a UUID and reports technical metrics including:</p>

<ul>
  <li>Hardware specs (GPU count/type, CPU architecture, available memory)</li>
  <li>Model configuration (architecture, dtype, tensor parallelism degree)</li>
  <li>Runtime settings (quantization type, prefix caching enabled)</li>
  <li>Deployment context (cloud provider, platform, vLLM version)</li>
</ul>

<p>This telemetry helps prioritize optimizations for common hardware configurations and identify which features need performance improvements. The data is collected locally in <code class="language-plaintext highlighter-rouge">~/.config/vllm/usage_stats.json</code>. Users can opt out by setting <code class="language-plaintext highlighter-rouge">VLLM_NO_USAGE_STATS=1</code>, <code class="language-plaintext highlighter-rouge">DO_NOT_TRACK=1</code>, or creating <code class="language-plaintext highlighter-rouge">~/.config/vllm/do_not_track</code>. The implementation details and full schema are available in our <a href="https://docs.vllm.ai/en/latest/serving/usage_stats.html">usage stats documentation</a>.</p>

<hr />

<h2 id="join-the-journey">Join the Journey</h2>

<p>vLLM’s 2024 journey demonstrates the transformative potential of open-source collaboration. With a clear vision for 2025, the project is poised to redefine AI inference, making it more accessible, scalable, and efficient. Whether through code contributions, attending <a href="https://hubs.li/Q02TFDTT0">vLLM Office Hours</a>, or adopting vLLM in production, every participant helps shape the future of this fast-moving project.</p>

<p>As we enter 2025, we continue to encourage community participation through:</p>

<ul>
  <li><strong>Contributing Code:</strong> Help refine vLLM’s core functionality or extend its capabilities—many RFCs and features need additional support</li>
  <li><strong>Providing Feedback:</strong> Share insights on features and use cases to shape vLLM’s roadmap via GitHub, Slack, Discord, or events</li>
  <li><strong>Building with vLLM:</strong> Adopt the platform in your projects, develop your expertise, and share your experience</li>
</ul>

<p>Join the <a href="https://slack.vllm.ai/">vLLM Developer Slack</a> to get mentored by project leaders and work at the forefront of AI inference innovation.</p>

<p><strong>Together, we’ll advance open-source AI innovation in 2025!</strong></p>]]></content><author><name>vLLM Team</name></author><summary type="html"><![CDATA[The vLLM community achieved remarkable growth in 2024, evolving from a specialized inference engine to become the de facto serving solution for the open-source AI ecosystem. This transformation is reflected in our growth metrics:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.vllm.ai/assets/figures/vllm-2024-wrapped-2025-roadmap/model-architecture-serving-usage.png" /><media:content medium="image" url="https://blog.vllm.ai/assets/figures/vllm-2024-wrapped-2025-roadmap/model-architecture-serving-usage.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Installing and Developing vLLM with Ease</title><link href="https://blog.vllm.ai/2025/01/10/dev-experience.html" rel="alternate" type="text/html" title="Installing and Developing vLLM with Ease" /><published>2025-01-10T00:00:00-08:00</published><updated>2025-01-10T00:00:00-08:00</updated><id>https://blog.vllm.ai/2025/01/10/dev-experience</id><content type="html" xml:base="https://blog.vllm.ai/2025/01/10/dev-experience.html"><![CDATA[<p>The field of LLM inference is advancing at an unprecedented pace. With new models and features emerging weekly, the traditional software release pipeline often struggles to keep up. At vLLM, we aim to provide more than just a software package. We’re building a system—a trusted, trackable, and participatory ecosystem for LLM inference. This blog post highlights how vLLM enables users to install and develop with ease while staying at the forefront of innovation.</p>

<h2 id="tldr">TL;DR:</h2>

<ul>
  <li>Flexible and fast installation options from stable releases to nightly builds.</li>
  <li>Streamlined development workflow for both Python and C++/CUDA developers.</li>
  <li>Robust version tracking capabilities for production deployments.</li>
</ul>

<h2 id="seamless-installation-of-vllm-versions">Seamless Installation of vLLM Versions</h2>

<h3 id="install-released-versions">Install Released Versions</h3>

<p>We periodically release stable versions of vLLM to the <a href="https://pypi.org/project/vllm/">Python Package Index</a>, ensuring users can easily install them using standard Python package managers. For example:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>vllm
</code></pre></div></div>

<p>For those who prefer a faster package manager, <a href="https://github.com/astral-sh/uv"><strong>uv</strong></a> has been gaining traction in the vLLM community. After setting up a Python environment with uv, installing vLLM is straightforward:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>uv pip <span class="nb">install </span>vllm
</code></pre></div></div>

<p>Refer to the <a href="https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html?device=cuda#create-a-new-python-environment">documentation</a> for more details on setting up <a href="https://github.com/astral-sh/uv"><strong>uv</strong></a>. Using a simple server-grade setup (Intel 8th Gen CPU), we observe that <a href="https://github.com/astral-sh/uv"><strong>uv</strong></a> is 200x faster than pip:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># with cached packages, clean virtual environment</span>
<span class="nv">$ </span><span class="nb">time </span>pip <span class="nb">install </span>vllm
...
pip <span class="nb">install </span>vllm 59.09s user 3.82s system 83% cpu 1:15.68 total

<span class="c"># with cached packages, clean virtual environment</span>
<span class="nv">$ </span><span class="nb">time </span>uv pip <span class="nb">install </span>vllm
...
uv pip <span class="nb">install </span>vllm 0.17s user 0.57s system 193% cpu 0.383 total
</code></pre></div></div>

<h3 id="install-the-latest-vllm-from-the-main-branch">Install the Latest vLLM from the Main Branch</h3>

<p>To meet the community’s need for cutting-edge features and models, we provide nightly wheels for every commit on the main branch.</p>

<p><strong>Using pip</strong>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>vllm <span class="nt">--pre</span> <span class="nt">--extra-index-url</span> https://wheels.vllm.ai/nightly
</code></pre></div></div>

<p>Adding <code class="language-plaintext highlighter-rouge">--pre</code> ensures pip includes pre-released versions in its search.</p>

<p><strong>Using uv</strong>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>uv pip <span class="nb">install </span>vllm <span class="nt">--extra-index-url</span> https://wheels.vllm.ai/nightly
</code></pre></div></div>

<h2 id="development-made-simple">Development Made Simple</h2>

<p>We understand that an active, engaged developer community is the backbone of innovation. That’s why vLLM offers smooth workflows for developers, regardless of whether they’re modifying Python code or working with kernels.</p>

<h3 id="python-developers">Python Developers</h3>

<p>For Python developers who need to tweak and test vLLM’s Python code, there’s no need to compile kernels. This setup enables you to start development quickly.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/vllm-project/vllm.git
<span class="nb">cd </span>vllm
<span class="nv">VLLM_USE_PRECOMPILED</span><span class="o">=</span>1 pip <span class="nb">install</span> <span class="nt">-e</span> <span class="nb">.</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">VLLM_USE_PRECOMPILED=1</code> flag instructs the installer to use pre-compiled CUDA kernels instead of building them from source, significantly reducing installation time. This is perfect for developers focusing on Python-level features like API improvements, model support, or integration work.</p>

<p>This lightweight process runs efficiently, even on a laptop. Refer to our <a href="https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html?device=cuda#build-wheel-from-source">documentation</a> for more advanced usage.</p>

<h3 id="ckernel-developers">C++/Kernel Developers</h3>

<p>For advanced contributors working with C++ code or CUDA kernels, we incorporate a compilation cache to minimize build time and streamline kernel development. Please check our <a href="https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html?device=cuda#build-wheel-from-source">documentation</a> for more details.</p>

<h2 id="track-changes-with-ease">Track Changes with Ease</h2>

<p>The fast-evolving nature of LLM inference means interfaces and behaviors are still stabilizing. vLLM has been integrated into many workflows, including <a href="https://github.com/OpenRLHF/OpenRLHF">OpenRLHF</a>, <a href="https://github.com/volcengine/verl">veRL</a>, <a href="https://github.com/allenai/open-instruct">open_instruct</a>, <a href="https://github.com/hiyouga/LLaMA-Factory">LLaMA-Factory</a>, etc. We collaborate with these projects to stabilize interfaces and behaviors for LLM inference. To facilitate the process, we provide powerful tools for these advanced users to track changes across versions.</p>

<h3 id="installing-a-specific-commit">Installing a Specific Commit</h3>

<p>To simplify tracking and testing, we provide wheels for every commit in the main branch. Users can easily install any specific commit, which can be particularly useful to bisect and track the changes.</p>

<p>We recommend using <a href="https://github.com/astral-sh/uv"><strong>uv</strong></a> to install a specific commit:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># use full commit hash from the main branch</span>
<span class="nb">export </span><span class="nv">VLLM_COMMIT</span><span class="o">=</span>72d9c316d3f6ede485146fe5aabd4e61dbc59069
uv pip <span class="nb">install </span>vllm <span class="nt">--extra-index-url</span> https://wheels.vllm.ai/<span class="k">${</span><span class="nv">VLLM_COMMIT</span><span class="k">}</span>
</code></pre></div></div>

<p>In <a href="https://github.com/astral-sh/uv"><strong>uv</strong></a>, packages in <code class="language-plaintext highlighter-rouge">--extra-index-url</code> have <a href="https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes">higher priority than the default index</a>, which makes it possible to install a developing version prior to the latest public release (at the time of writing, it is v0.6.6.post1).</p>

<p>In contrast, pip combines packages from <code class="language-plaintext highlighter-rouge">--extra-index-url</code> and the default index, choosing only the latest version, which makes it difficult to install a developing version prior to the released version. Therefore, for pip users, it requires specifying a placeholder wheel name to install a specific commit:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># use full commit hash from the main branch</span>
<span class="nb">export </span><span class="nv">VLLM_COMMIT</span><span class="o">=</span>33f460b17a54acb3b6cc0b03f4a17876cff5eafd
pip <span class="nb">install </span>https://wheels.vllm.ai/<span class="k">${</span><span class="nv">VLLM_COMMIT</span><span class="k">}</span>/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>At vLLM, our commitment extends beyond delivering high-performance software. We’re building a system that empowers trust, enables transparent tracking of changes, and invites active participation. Together, we can shape the future of AI, pushing the boundaries of innovation while making it accessible to all.</p>

<p>For collaboration requests or inquiries, reach out at <a href="mailto:vllm-questions@lists.berkeley.edu">vllm-questions@lists.berkeley.edu</a>. Join our growing community on <a href="https://github.com/vllm-project/vllm">GitHub</a> or connect with us on the <a href="https://slack.vllm.ai/">vLLM Slack</a>. Together, let’s drive AI innovation forward.</p>

<h2 id="acknowledgments">Acknowledgments</h2>

<p>We extend our gratitude to the <a href="https://docs.astral.sh/uv/">uv community</a> — particularly <a href="https://github.com/charliermarsh">Charlie Marsh</a> — for creating a fast, innovative package manager. Special thanks to <a href="https://github.com/khluu">Kevin Luu</a> (Anyscale), <a href="https://github.com/dtrifiro">Daniele Trifirò</a> (Red Hat), and <a href="https://github.com/mgoin">Michael Goin</a> (Neural Magic) for their invaluable contributions to streamlining workflows. <a href="https://github.com/youkaichao">Kaichao You</a> and <a href="https://github.com/simon-mo">Simon Mo</a> from the UC Berkeley team lead these efforts.</p>]]></content><author><name>vLLM Team</name></author><summary type="html"><![CDATA[The field of LLM inference is advancing at an unprecedented pace. With new models and features emerging weekly, the traditional software release pipeline often struggles to keep up. At vLLM, we aim to provide more than just a software package. We’re building a system—a trusted, trackable, and participatory ecosystem for LLM inference. This blog post highlights how vLLM enables users to install and develop with ease while staying at the forefront of innovation.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.vllm.ai/assets/logos/vllm-logo-only-light.png" /><media:content medium="image" url="https://blog.vllm.ai/assets/logos/vllm-logo-only-light.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Serving LLMs on AMD MI300X: Best Practices</title><link href="https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html" rel="alternate" type="text/html" title="Serving LLMs on AMD MI300X: Best Practices" /><published>2024-10-23T00:00:00-07:00</published><updated>2024-10-23T00:00:00-07:00</updated><id>https://blog.vllm.ai/2024/10/23/vllm-serving-amd</id><content type="html" xml:base="https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html"><![CDATA[<p><strong>TL;DR:</strong> vLLM unlocks incredible performance on the AMD MI300X, achieving 1.5x higher throughput and 1.7x faster time-to-first-token (TTFT) than Text Generation Inference (TGI) for Llama 3.1 405B. It also achieves 1.8x higher throughput and 5.1x faster TTFT than TGI for Llama 3.1 70B. This guide explores 8 key vLLM settings to maximize efficiency, showing you how to leverage the power of open-source LLM inference on AMD. If you just want to see the optimal parameters, jump to the <a href="#quick-start-guide">Quick Start Guide</a>.</p>

<p align="center">
<picture>
<img src="/assets/figures/vllm-serving-amd/405b1.png" width="35%" />
</picture><picture>
&nbsp; &nbsp;
<img src="/assets/figures/vllm-serving-amd/405b2.png" width="35%" />
</picture><br />
vLLM vs. TGI performance comparison for Llama 3.1 405B on 8 x MI300X (BF16, 32 QPS).
</p>

<p align="center">
<picture>
<img src="/assets/figures/vllm-serving-amd/70b1.png" width="35%" />
</picture><picture>
&nbsp; &nbsp;
<img src="/assets/figures/vllm-serving-amd/70b2.png" width="35%" />
</picture><br />
vLLM vs. TGI performance comparison for Llama 3.1 70B on 8 x MI300X (BF16, 32 QPS).
</p>

<h3 id="introduction">Introduction</h3>

<p>Meta recently announced they’re running 100% of their live Llama 3.1 405B model traffic on AMD MI300X GPUs, showcasing the power and readiness of AMD’s ROCm platform for large language model (LLM) inference. This exciting news coincides with the release of ROCm 6.2, which brings significant improvements to vLLM support, making it easier than ever to harness the power of AMD GPUs for LLM inference.</p>

<p>ROCm, AMD’s answer to CUDA, might be less familiar to some, but it’s rapidly maturing as a robust and performant alternative.  With vLLM, harnessing this power is easier than ever.  We’ll show you how.</p>

<h3 id="vllm-vs-tgi">vLLM v.s. TGI</h3>

<p>vLLM unlocks incredible performance on the AMD MI300X, achieving 1.5x higher throughput and 1.7x faster time-to-first-token (TTFT) than Text Generation Inference (TGI) for Llama 3.1 405B. It also achieves 1.8x higher throughput and 5.1x faster TTFT than TGI for Llama 3.1 70B.</p>

<p>On Llama 3.1 405B, vLLM demonstrates significantly better performance compared to TGI in both time to first token (TTFT) and throughput across various query-per-second (QPS) scenarios. For TTFT, vLLM achieves approximately 3.8x faster response times on average compared to TGI at 16 QPS in the optimized configuration. Throughput-wise, vLLM consistently outperforms TGI, with the highest throughput of 5.76 requests/second on the ShareGPT dataset at 1000 QPS in the optimized setup, compared to TGI’s 3.55 requests/second.</p>

<p>Even in the default configuration, vLLM shows superior performance compared to TGI. For instance, at 16 QPS, vLLM’s default configuration achieves a throughput of 4.05 requests/second versus TGI’s 2.58 requests/second. This performance advantage is maintained across different QPS levels, highlighting vLLM’s efficiency in handling large language model inference tasks.</p>

<p align="center">
<picture>
<img src="/assets/figures/vllm-serving-amd/introduction/Throughput (Requests per Second).png" width="70%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/introduction/Mean TTFT (ms).png" width="70%" />
</picture><br />
vLLM vs. TGI performance for Llama 3.1 405B on 8 x MI300X (BF16, QPS 16, 32, 1000; see Appendix for commands).
</p>

<h3 id="how-to-run-vllm-with-optimal-performance">How to run vLLM with Optimal Performance</h3>

<h4 id="key-settings-and-configurations">Key Settings and Configurations</h4>

<p>We’ve been extensively testing various vLLM settings to identify optimal configurations for MI300X.  Here’s what we’ve learned:</p>

<ul>
  <li><strong>Chunked Prefill</strong>: The rule of thumb is to disable it for now on MI300X in most cases for better performance.</li>
  <li><strong>Multi-Step Scheduling</strong>: Significant gains in GPU utilization and overall performance can be achieved with multi-step scheduling. Set the <code class="language-plaintext highlighter-rouge">--num-scheduler-steps</code> to a value between 10 and 15 to optimize GPU utilization and performance.</li>
  <li><strong>Prefix Caching</strong>: Combining prefix caching with chunked prefill can enhance performance in specific scenarios. However, if user requests have a low prefix caching hit rate, it might be advisable to disable both chunked prefill and prefix caching.</li>
  <li><strong>Graph Capture</strong>: When working with models that support long context lengths, set the <code class="language-plaintext highlighter-rouge">--max-seq-len-to-capture</code> to 16384. However, be aware that increasing this value doesn’t always guarantee performance improvements and may sometimes lead to degradation due to suboptimal bucket sizes.</li>
  <li><strong>AMD-Specific Optimizations</strong>: Disabling NUMA balancing and tuning <code class="language-plaintext highlighter-rouge">NCCL_MIN_NCHANNELS</code> can yield further performance improvements.</li>
  <li><strong>KV Cache Data Type</strong>: For optimal performance, use the default KV cache data type, which automatically matches the model’s data type.</li>
  <li><strong>Tensor Parallelism</strong>: For throughput optimization, use the minimum tensor parallelism (TP) that accommodates the model weights and context, and run multiple vLLM instances. For latency optimization, set TP equal to the number of GPUs in a node.</li>
  <li><strong>Maximum Number of Sequences</strong>: To optimize performance, increase <code class="language-plaintext highlighter-rouge">--max-num-seqs</code> to 512 or higher, based on your GPU’s memory and compute resources. This can significantly improve resource utilization and throughput, especially for models handling shorter inputs and outputs.</li>
  <li><strong>Use CK Flash Attention</strong>: the CK Flash Attention implementation is a lot faster than triton implementation.</li>
</ul>

<h4 id="detailed-analysis-and-experiments">Detailed Analysis and Experiments</h4>

<h5 id="case-1-chunked-prefill">Case 1: Chunked Prefill</h5>

<p>Chunked prefill is an experimental feature in vLLM that allows large prefill requests to be divided into smaller chunks batched together with decode requests. This improves system efficiency by overlapping compute-bound prefill requests with memory-bound decode requests. You can enable it by setting <code class="language-plaintext highlighter-rouge">--enable_chunked_prefill=True</code> in the LLM constructor or using the <code class="language-plaintext highlighter-rouge">--enable-chunked-prefill</code> command line option.</p>

<p>Based on the experiment we ran, we found that there’s a slight improvement with tuning the chunked prefill values over disabling the chunked prefill feature. However, if you’re not sure whether to enable chunked prefill or not, simply start off by disabling it and you should generally expect better performance than with using the default settings. This is specific to MI300X GPUs.</p>

<p align="center">
<picture>
<img src="/assets/figures/vllm-serving-amd/case01-chunked-prefill/Requests Per Second.png" width="85%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case01-chunked-prefill/Mean TTFT (ms).png" width="85%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case01-chunked-prefill/Mean TPOT (ms).png" width="85%" />
</picture><br />
</p>

<h5 id="case-2-number-of-scheduler-steps">Case 2: Number of scheduler steps</h5>

<p><em>Multi-step scheduling</em> has been introduced In vLLM v0.6.0 promising higher gpu utilization and better overall performance. As detailed in this <a href="https://blog.vllm.ai/2024/09/05/perf-update.html">blog post</a>, the magic behind this performance boost lies in its ability to perform scheduling and input preparation once and run the model for a number of consecutive steps without interrupting the GPU. By cleverly spreading CPU overhead across these steps, it dramatically reduces GPU idle time and supercharges performance.</p>

<p>To enable multi-step scheduling, set the <code class="language-plaintext highlighter-rouge">--num-scheduler-steps</code> argument to a number larger than 1, which is the default value (It’s worth mentioning that we found that using multi-step scheduling can provide diminishing returns the higher it goes up in value, hence, we stick with an upper bound of 15).</p>

<p align="center">
<picture>
<img src="/assets/figures/vllm-serving-amd/case02-num-scheduler-steps/Requests per Second.png" width="100%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case02-num-scheduler-steps/Mean TTFT (ms).png" width="100%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case02-num-scheduler-steps/Mean TPOT (ms).png" width="100%" />
</picture><br />
</p>

<h5 id="case-3-chunked-prefill-and-prefix-caching">Case 3: Chunked Prefill and Prefix caching</h5>

<p>Chunked Prefill and prefix caching are optimization techniques in vLLM that improve performance by breaking large prefills into smaller chunks for efficient batching and reusing cached KV (key-value) computations for shared prefixes across queries, respectively.</p>

<p>By default, vLLM will automatically <em>enable the chunked prefill feature if a model has a context length of more than 32k tokens</em>. The maximum number of tokens to be chunked for prefill is set to 512 by default.</p>

<p>Before we dive deep into the graph, we’ll first try to explain the terminology used in the experiment. <strong><em>Fresh Run</em></strong> refers to the situation where the prefix caching memory is not populated at all. <strong><em>2nd Run</em></strong> refers to rerunning the benchmark script again after the <em>Fresh Run</em>. In general, when rerunning the ShareGPT benchmark dataset on the <em>2nd Run</em>, we get around a <em>50%</em> prefix caching hit-rate.</p>

<p>Looking at the graphs below, we can make three observations about this experiment.</p>
<ol>
  <li>Based on the comparison of Bar 2 (red) with the baseline (blue), there is a huge gain in performance.</li>
  <li>Based on the comparison of Bar 3 (yellow), Bar 5 (orange) and Bar 6 (teal) with the baseline, the chunked prefill performance depends on the user request input prompt length distribution.</li>
  <li>In our experiments we found that the prefix caching hit rates of Bar 3 (yellow) and Bar 4 (green) are around <em>0.9%</em> and <em>50%</em>. Based on the comparison of Bar 3 (yellow) and Bar 4 (green) with the baseline and Bar 2 (red), this tells us that if the user requests do not have high prefix caching hit rate, disabling both chunked prefill and prefix caching might be considered a good rule of thumb.</li>
</ol>

<p align="center">
<picture>
<img src="/assets/figures/vllm-serving-amd/case03-chunked-prefill-and-prefix-caching/Requests per Second.png" width="100%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case03-chunked-prefill-and-prefix-caching/Mean TTFT (ms).png" width="100%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case03-chunked-prefill-and-prefix-caching/Mean TPOT (ms).png" width="100%" />
</picture><br />
</p>

<h5 id="case-4-max-sequence-length-to-capture">Case 4: Max sequence length to capture</h5>

<p>The <code class="language-plaintext highlighter-rouge">--max-seq-len-to-capture</code> argument in vLLM controls the maximum sequence length that can be handled by CUDA/HIP graphs, which optimize performance by capturing and replaying GPU operations. If a sequence exceeds this length, the system reverts to eager mode executing operations one by one, which can be less efficient. This applies to both regular and encoder-decoder models.</p>

<p>Our benchmarks reveal an interesting trend: increasing <code class="language-plaintext highlighter-rouge">--max-seq-len-to-capture</code> doesn’t always improve performance and can sometimes even degrade it. This might be due to how vLLM creates buckets for different sequence lengths.</p>

<p>Here’s why:</p>
<ul>
  <li><strong>Bucketing</strong>: vLLM uses buckets to group sequences of similar lengths, optimizing graph capture for each bucket.</li>
  <li><strong>Optimal Buckets</strong>: Initially, the buckets are finely grained (e.g., [4, 8, 12,…, 2048, 4096]), allowing for efficient graph capture for various sequence lengths.</li>
  <li><strong>Coarser Buckets</strong>: Increasing <code class="language-plaintext highlighter-rouge">--max-seq-len-to-capture</code> can lead to coarser buckets (e.g., [4, 8, 12, 2048, 8192]).</li>
  <li><strong>Performance Impact</strong>: When input sequences fall into these larger, less precise buckets, the captured CUDA/HIP graphs may not be optimal, potentially leading to reduced performance.</li>
</ul>

<p>Therefore, while capturing longer sequences with CUDA/HIP graphs seems beneficial, it’s crucial to consider the potential impact on bucketing and overall performance. Finding the optimal <code class="language-plaintext highlighter-rouge">--max-seq-len-to-capture</code> value may require experimentation to balance graph capture efficiency with appropriate bucket sizes for your specific workload.</p>

<p align="center">
<picture>
<img src="/assets/figures/vllm-serving-amd/case04-max-seq-len-to-capture/Requests per Second.png" width="100%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case04-max-seq-len-to-capture/Mean TTFT (ms).png" width="100%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case04-max-seq-len-to-capture/Mean TPOT (ms).png" width="100%" />
</picture><br />
</p>

<h5 id="case-5-amd-recommended-environmental-variables">Case 5: AMD Recommended Environmental Variables</h5>

<p>To further optimize vLLM performance on AMD MI300X, we can leverage AMD-specific environment variables.</p>

<ul>
  <li><strong>Disabling NUMA Balancing</strong>: Non-Uniform Memory Access (NUMA) balancing can sometimes hinder GPU performance. As recommended in the <a href="https://github.com/ROCm/MAD/blob/develop/benchmark/vllm/README.md">AMD MAD repository</a>, disabling it can prevent potential GPU hangs and improve overall efficiency. This can be achieved with the following command:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="c"># disable automatic NUMA balancing</span>
  sh <span class="nt">-c</span> <span class="s1">'echo 0 &gt; /proc/sys/kernel/numa_balancing'</span>
  <span class="c"># check if NUMA balancing is disabled (returns 0 if disabled)</span>
  <span class="nb">cat</span> /proc/sys/kernel/numa_balancing
  0
</code></pre></div>    </div>
  </li>
  <li><strong>Tuning NCCL Communication</strong>: The NVIDIA Collective Communications Library (NCCL) is used for inter-GPU communication. For MI300X, the <a href="https://github.com/ROCm/vllm/blob/main/ROCm_performance.md">AMD vLLM fork performance document</a> suggests setting the <code class="language-plaintext highlighter-rouge">NCCL_MIN_NCHANNELS</code> environment variable to 112 to potentially enhance performance.</li>
</ul>

<p>In our tests, enabling these two configurations yielded a slight performance improvement. This aligns with the findings in the <a href="https://arxiv.org/abs/2408.12757">“NanoFlow: Towards Optimal Large Language Model Serving Throughput” paper</a>, which indicates that while optimizing network communication is beneficial, the impact might be limited since LLM inference is primarily dominated by compute-bound and memory-bound operations.</p>

<p>Even though the gains might be small, fine-tuning these environment variables can contribute to squeezing out the maximum performance from your AMD system.</p>

<p align="center">
<picture>
<img src="/assets/figures/vllm-serving-amd/case05-amd-recommended-environmental-variables/Requests Per Second.png" width="75%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case05-amd-recommended-environmental-variables/Mean TTFT (ms).png" width="75%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case05-amd-recommended-environmental-variables/Mean TPOT (ms).png" width="75%" />
</picture><br />
</p>

<h5 id="case-6-kvcache-type-autofp8">Case 6: KVCache Type Auto/FP8</h5>

<p>By default, vLLM will automatically allocate a KV Cache type that matches the model’s data type. However, vLLM also supports native FP8 on MI300X which we can exploit to reduce the memory requirement of KVCache and thereby increasing the deployable context length of the model.</p>

<p>We experiment by using Auto KVCache type and KV Cache type FP8 and compare it to the default baseline. We can see from the figure below that using Auto KVCache type (red) achieves a higher request per second rate than using KV Cache type set to FP8 (yellow). Theoretically, this might be due to a quantization overhead in <code class="language-plaintext highlighter-rouge">Llama-3.1-70B-Instruct (bfloat16)</code> model, but since the cost of the overhead seems to be small, it could still be a good tradeoff in some cases to obtain a huge reduction in the KVCache requirements.</p>

<p align="center">
<picture>
<img src="/assets/figures/vllm-serving-amd/case06-kvcache-type/Requests per Second.png" width="90%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case06-kvcache-type/Mean TTFT (ms).png" width="90%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case06-kvcache-type/Mean TPOT (ms).png" width="90%" />
</picture><br />
</p>

<h5 id="case-7-performance-difference-between-tp-4-and-tp-8">Case 7: Performance Difference between TP 4 and TP 8</h5>

<p>Tensor parallelism is a technique for distributing the computational load of large models. It works by splitting individual tensors across multiple devices, allowing for parallel processing of specific operations or layers. This approach reduces the memory footprint of the model and enables scaling across multiple GPUs.</p>

<p>While increasing the tensor parallelism degree can improve performance by providing more compute resources, the gains aren’t always linear. This is because communication overhead increases as more devices are involved, and the workload on each individual GPU decreases. Given the substantial processing power of the MI300X, smaller workloads per GPU can actually lead to underutilization, further hindering performance scaling.</p>

<p>Therefore, when optimizing for throughput, we recommend launching multiple instances of vLLM instead of aggressively increasing tensor parallelism. This approach tends to yield more linear performance improvements. However, if minimizing latency is the priority, increasing the tensor parallelism degree may be the more effective strategy.</p>

<p align="center">
<picture>
<img src="/assets/figures/vllm-serving-amd/case07-tensor-parallelism/Requests per Second.png" width="85%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case07-tensor-parallelism/Mean TTFT (ms).png" width="85%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case07-tensor-parallelism/Mean TPOT (ms).png" width="85%" />
</picture><br />
</p>

<h5 id="case-8-effect-of-maximum-number-of-parallel-sequences">Case 8: Effect of Maximum Number of (Parallel) Sequences</h5>

<p>The <code class="language-plaintext highlighter-rouge">--max-num-seqs</code> argument specifies the maximum number of sequences that can be processed per iteration. This parameter controls the number of concurrent requests in a batch, impacting memory usage and performance. In the ShareGPT benchmark, due to the shorter input and output length of the samples, the <code class="language-plaintext highlighter-rouge">Llama-3.1-70B-Instruct</code> hosted on MI300X can process a large number of requests per iteration. In our experiment, the <code class="language-plaintext highlighter-rouge">--max-num-seqs</code> is still a limiting factor, even if <code class="language-plaintext highlighter-rouge">--max-num-seqs</code> is set at 1024.</p>

<p align="center">
<picture>
<img src="/assets/figures/vllm-serving-amd/case08-max-num-seq/Request per Second.png" width="100%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case08-max-num-seq/Mean TTFT (ms).png" width="100%" />
</picture><br />
<picture>
<img src="/assets/figures/vllm-serving-amd/case08-max-num-seq/Mean TPOT (ms).png" width="100%" />
</picture><br />
</p>

<h3 id="quick-start-guide">Quick Start Guide</h3>
<p>If you are not sure about the deployment setting and the distribution of the user requests, you could:</p>

<ul>
  <li>Use CK Flash Attention* (thought we didn’t show here, the CK Flash Attention implementation is a lot faster than triton counterpart implementation)
    <ul>
      <li><code class="language-plaintext highlighter-rouge">export VLLM_USE_TRITON_FLASH_ATTN=0</code></li>
    </ul>
  </li>
  <li>Disable chunked prefill <code class="language-plaintext highlighter-rouge">--enable-chunked-prefill=False</code></li>
  <li>Disable prefix caching</li>
  <li>If the model supports long context length, set the <code class="language-plaintext highlighter-rouge">--max-seq-len-to-capture</code> to 16384</li>
  <li>Set <code class="language-plaintext highlighter-rouge">--num-scheduler-steps</code> to 10 or 15.</li>
  <li>Set the AMD environment:
    <ul>
      <li><code class="language-plaintext highlighter-rouge">sh -c 'echo 0 &gt; /proc/sys/kernel/numa_balancing' </code></li>
      <li><code class="language-plaintext highlighter-rouge">export NCCL_MIN_NCHANNELS=112</code></li>
    </ul>
  </li>
  <li>Increase <code class="language-plaintext highlighter-rouge">--max-num-seqs</code> to 512 and above, depending on the GPU memory and compute resource of the GPUs.</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">VLLM_USE_TRITON_FLASH_ATTN</span><span class="o">=</span>0 vllm serve meta-llama/Llama-3.1-70B-Instruct <span class="nt">--host</span> 0.0.0.0 <span class="nt">--port</span> 8000 <span class="nt">-tp</span> 4 <span class="nt">--max-num-seqs</span> 1024 <span class="nt">--max-seq-len-to-capture</span> 16384 <span class="nt">--served-model-name</span> meta-llama/Llama-3.1-70B-Instruct <span class="nt">--enable-chunked-prefill</span><span class="o">=</span>False <span class="nt">--num-scheduler-steps</span> 15 <span class="nt">--max-num-seqs</span> 1024
</code></pre></div></div>

<p>For quick setup, we have compiled the Docker Image of vLLM 0.6.2 (commit: <em>cb3b2b9ba4a95c413a879e30e2b8674187519a93</em>) to Github Container Registry.
To get download the image:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># v0.6.2 post</span>
docker pull ghcr.io/embeddedllm/vllm-rocm:cb3b2b9
<span class="c"># P.S. We also have compiled the image for v0.6.3.post1 at commit 717a5f8</span>
docker pull ghcr.io/embeddedllm/vllm-rocm:v0.6.3.post1-717a5f8
</code></pre></div></div>

<p>To launch a docker container with the image run:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>docker run <span class="nt">-it</span> <span class="se">\</span>
   <span class="nt">--network</span><span class="o">=</span>host <span class="se">\</span>
   <span class="nt">--group-add</span><span class="o">=</span>video <span class="se">\</span>
   <span class="nt">--ipc</span><span class="o">=</span>host <span class="se">\</span>
   <span class="nt">--cap-add</span><span class="o">=</span>SYS_PTRACE <span class="se">\</span>
   <span class="nt">--security-opt</span> <span class="nv">seccomp</span><span class="o">=</span>unconfined <span class="se">\</span>
   <span class="nt">--device</span> /dev/kfd <span class="se">\</span>
   <span class="nt">--device</span> /dev/dri <span class="se">\</span>
   <span class="nt">-v</span> /path/to/hfmodels:/app/model <span class="se">\ </span><span class="c"># if you have pre-downloaded the model weight, else ignore</span>
   ghcr.io/embeddedllm/vllm-rocm:cb3b2b9 <span class="se">\</span>
   bash
</code></pre></div></div>

<p>Now launch the LLM server with the parameters that we have found:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">VLLM_USE_TRITON_FLASH_ATTN</span><span class="o">=</span>0 vllm serve meta-llama/Llama-3.1-70B-Instruct <span class="nt">--host</span> 0.0.0.0 <span class="nt">--port</span> 8000 <span class="nt">-tp</span> 4 <span class="nt">--max-num-seqs</span> 1024 <span class="nt">--max-seq-len-to-capture</span> 16384 <span class="nt">--served-model-name</span> meta-llama/Llama-3.1-70B-Instruct <span class="nt">--enable-chunked-prefill</span><span class="o">=</span>False <span class="nt">--num-scheduler-steps</span> 15 <span class="nt">--max-num-seqs</span> 1024
</code></pre></div></div>

<h3 id="conclusion">Conclusion</h3>
<p>This guide has explored the power of vLLM for serving large language models on AMD MI300X GPUs. By meticulously tuning key settings like chunked prefill, multi-step scheduling, and CUDA graph capture, we’ve demonstrated how to achieve substantial performance gains over standard configurations and alternative serving solutions. vLLM unlocks significantly higher throughput and faster response times, making it an ideal choice for deploying LLMs on AMD hardware.</p>

<p>However, it’s important to acknowledge that our exploration has focused primarily on general chatbot usage with short inputs and outputs. Further investigation is needed to optimize vLLM for specific use cases like summarization or long-form content generation. Additionally, a deeper dive into the performance differences between Triton and CK attention kernels could yield further insights.</p>

<p>We also want to acknolwedge <a href="https://shisa.ai/blog/posts/tuning-vllm-mi300x/">this wonderful blogpost</a> by Leonard Lin on how to further optimize vLLM for MI300X, including hipBLAS vs hipBLASLt, CK Flash Attention vs Triton Flash Attention, Tensor Parallelism vs Pipeline Parallelism, etc.</p>

<h3 id="acknowledgements">Acknowledgements</h3>
<p>This blog post is drafted by the team at <a href="https://embeddedllm.com/">Embedded LLM</a> and thank you to <a href="https://hotaisle.xyz/">Hot Aisle Inc.</a> for sponsoring MI300X for benchmarking vLLM.</p>

<h3 id="appendix">Appendix</h3>

<h4 id="server-specification">Server Specification</h4>

<p>The following are the configuration of the amazing Hot Aisle server:</p>
<ul>
  <li>CPU: 2 x Intel Xeon Platinum 8470</li>
  <li>GPU: 8 x AMD Instinct MI300X Accelerators
The model and software that we are using in the benchmark are as follows:</li>
  <li>Model: meta-llama/Llama-3.1-405B-Instruct and meta-llama/Llama-3.1-70B-Instruct</li>
  <li>vLLM (v0.6.2): vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs (github.com) commit: cb3b2b9ba4a95c413a879e30e2b8674187519a93</li>
  <li>Dataset: ShareGPT</li>
  <li>Benchmark script: benchmarks/benchmark_serving.py in the repository</li>
</ul>

<p>We have built the ROCm compatible vLLM docker from Dockerfile.rocm found in the repository (we have pushed the docker image of the vLLM version that we have used to run our benchmark. Get it by <code class="language-plaintext highlighter-rouge">docker pull ghcr.io/embeddedllm/vllm-rocm:cb3b2b9</code>).
<strong>All of the benchmarks are run in the docker container instance, and are run with 4 MI300X GPUs using CK Flash Attention with <code class="language-plaintext highlighter-rouge">VLLM_USE_TRITON_FLASH_ATTN=0.</code></strong></p>

<h4 id="detail-benchmark-configuration">Detail Benchmark Configuration</h4>

<table>
  <thead>
    <tr>
      <th>Configuration</th>
      <th>Command</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>vLLM Default Configuration</td>
      <td><code class="language-plaintext highlighter-rouge">VLLM_RPC_TIMEOUT=30000 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve Llama-3.1-405B-Instruct -tp 8 --max-num-seqs 1024 --max-num-batched-tokens 1024 </code></td>
    </tr>
    <tr>
      <td>TGI Default Configuration</td>
      <td><code class="language-plaintext highlighter-rouge">ROCM_USE_FLASH_ATTN_V2_TRITON=false TRUST_REMOTE_CODE=true text-generation-launcher --num-shard 8 --sharded true --max-concurrent-requests 1024 --model-id Llama-3.1-405B-Instruct</code></td>
    </tr>
    <tr>
      <td>vLLM (This Guide)</td>
      <td><code class="language-plaintext highlighter-rouge">VLLM_RPC_TIMEOUT=30000 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve Llama-3.1-405B-Instruct -tp 8 --max-seq-len-to-capture 16384 --enable-chunked-prefill=False --num-scheduler-steps 15 --max-num-seqs 1024 </code></td>
    </tr>
    <tr>
      <td>TGI (This Guide)</td>
      <td><code class="language-plaintext highlighter-rouge">ROCM_USE_FLASH_ATTN_V2_TRITON=false TRUST_REMOTE_CODE=true text-generation-launcher --num-shard 8 --sharded true --max-concurrent-requests 1024 --max-total-tokens 131072 --max-input-tokens 131000 --model-id Llama-3.1-405B-Instruct</code></td>
    </tr>
  </tbody>
</table>]]></content><author><name>Guest Post by Embedded LLM and Hot Aisle Inc.</name></author><summary type="html"><![CDATA[TL;DR: vLLM unlocks incredible performance on the AMD MI300X, achieving 1.5x higher throughput and 1.7x faster time-to-first-token (TTFT) than Text Generation Inference (TGI) for Llama 3.1 405B. It also achieves 1.8x higher throughput and 5.1x faster TTFT than TGI for Llama 3.1 70B. This guide explores 8 key vLLM settings to maximize efficiency, showing you how to leverage the power of open-source LLM inference on AMD. If you just want to see the optimal parameters, jump to the Quick Start Guide.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.vllm.ai/assets/figures/vllm-serving-amd/405b1.png" /><media:content medium="image" url="https://blog.vllm.ai/assets/figures/vllm-serving-amd/405b1.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">How Speculative Decoding Boosts vLLM Performance by up to 2.8x</title><link href="https://blog.vllm.ai/2024/10/17/spec-decode.html" rel="alternate" type="text/html" title="How Speculative Decoding Boosts vLLM Performance by up to 2.8x" /><published>2024-10-17T00:00:00-07:00</published><updated>2024-10-17T00:00:00-07:00</updated><id>https://blog.vllm.ai/2024/10/17/spec-decode</id><content type="html" xml:base="https://blog.vllm.ai/2024/10/17/spec-decode.html"><![CDATA[<p>Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in vLLM, how it works, and the performance improvements it brings.</p>

<p><em>This content is based on a session from our bi-weekly vLLM Office Hours, where we discuss techniques and updates to optimize vLLM performance. You can <a href="https://docs.google.com/presentation/d/1wUoLmhfX6B7CfXy3o4m-MdodRL26WvY3/edit#slide=id.p1">view the session slides here</a>. If you prefer watching, you can <a href="https://youtu.be/eVJBFajJRIU?si=9BKjcFkhdOwRcIiy">view the full recording on YouTube</a>. We’d love to see you <a href="https://neuralmagic.com/community-office-hours/?utm_campaign=vLLM%20Office%20Hours&amp;utm_source=vllm-blog">attend future sessions</a> - please register!</em></p>

<h2 id="an-introduction-to-speculative-decoding">An Introduction to Speculative Decoding</h2>

<p>Speculative decoding (<a href="https://arxiv.org/abs/2211.17192">Leviathan et al., 2023</a>) is a key technique in reducing latency during token generation in large language models (LLMs). This approach leverages smaller models to handle simpler token predictions while utilizing larger models to verify or adjust those predictions. By doing this, speculative decoding accelerates generation without sacrificing accuracy, making it a lossless yet highly efficient method for optimizing LLM performance.</p>

<p><strong>Why can speculative decoding reduce latency?</strong> Traditionally, LLMs generate tokens one at a time in an autoregressive manner. For example, given a prompt, the model generates three tokens T1, T2, T3, each requiring a separate forward pass. Speculative decoding transforms this process by allowing multiple tokens to be proposed and verified in one forward pass.</p>

<p>Here’s how the process works:</p>

<ol>
  <li><strong>Draft Model</strong>: A smaller, more efficient model proposes tokens one by one.</li>
  <li><strong>Target Model Verification</strong>: The larger model verifies these tokens in a single forward pass. It confirms correct tokens and corrects any incorrect ones.</li>
  <li><strong>Multiple Tokens in One Pass</strong>: Instead of generating one token per pass, this method processes multiple tokens simultaneously, reducing latency.</li>
</ol>

<p align="center">
<picture>
<img src="/assets/figures/spec-decode/figure8.png" width="80%" />
</picture><br />
As shown in the picture above, the draft model proposes five tokens: ["I", "like", "cooking", "and", "traveling"]. These are then forwarded to the target model for parallel verification. In this example, the third token, "cooking" (should be "playing"), was proposed inaccurately. As a result, only the first three tokens, ["I", "like", "playing"], are generated in this step.
</p>

<p>By using this approach, speculative decoding speeds up token generation, making it an effective method for both small-scale and large-scale language model deployments.</p>

<h2 id="how-speculative-decoding-works-in-vllm">How Speculative Decoding Works in vLLM</h2>

<p>In vLLM, speculative decoding is integrated with the system’s <strong>continuous batching</strong> architecture, where different requests are processed together in a single batch, enabling higher throughput. vLLM uses two key components to implement this:</p>

<ul>
  <li><strong>Draft Runner</strong>: This runner is responsible for executing the smaller model to propose candidate tokens.</li>
  <li><strong>Target Runner</strong>: The target runner verifies the tokens by running the larger model.</li>
</ul>

<p>vLLM’s system is optimized to handle this process efficiently, allowing speculative decoding to work seamlessly with continuous batching, which increases the overall system performance.</p>

<p align="center">
<picture>
<img src="/assets/figures/spec-decode/figure1.png" width="80%" />
</picture><br />
Diagram illustrating how the draft and target runners interact within the vLLM batching system.
</p>

<p>To implement speculative decoding in vLLM, two crucial components had to be modified:</p>

<ol>
  <li><strong>Scheduler</strong>: The scheduler was adjusted to handle multiple token slots within a single forward pass, enabling the simultaneous generation and verification of several tokens.</li>
  <li><strong>Memory Manager</strong>: The memory manager now handles the KV cache for both the draft and target models, ensuring smooth processing during speculative decoding.</li>
</ol>

<p align="center">
<picture>
<img src="/assets/figures/spec-decode/figure9.png" width="80%" />
</picture><br />
System architecture of speculative decoding in vLLM.
</p>

<h2 id="types-of-speculative-decoding-supported-in-vllm">Types of Speculative Decoding Supported in vLLM</h2>

<p>vLLM supports three types of speculative decoding, each tailored to different workloads and performance needs:</p>

<h3 id="draft-model-based-speculative-decoding">Draft Model-Based Speculative Decoding</h3>

<p align="center">
<picture>
<img src="/assets/figures/spec-decode/figure2.png" width="80%" />
</picture>
</p>

<p>This is the most commonly used form of speculative decoding, where a smaller model predicts the next tokens, and a larger model verifies them. A common example would be using a Llama 68M model to predict tokens for a Llama 2 70B model. This approach requires careful selection of the draft model to balance accuracy and overhead.</p>

<p>Choosing the correct draft model is essential for maximizing the efficiency of speculative decoding. The draft model needs to be small enough to avoid creating significant overhead but still accurate enough to provide a meaningful performance boost.</p>

<p>However, selecting the right draft model can be challenging. For example, in models like Llama 3, finding a suitable draft model is difficult due to differences in vocabulary size. Speculative decoding requires that the draft and target models share the same vocabulary, and in some cases, this can limit the use of speculative decoding. Therefore, in the following sections, we introduce several draft-model free speculative decoding methods.</p>

<h3 id="prompt-lookup-decoding">Prompt Lookup Decoding</h3>

<p align="center">
<picture>
<img src="/assets/figures/spec-decode/figure3.png" width="80%" />
</picture><br />
An example of prompt lookup decoding. Given the prompt, we build all 2-grams as the lookup key. The values are the three tokens following the lookup key. During generation, we will check if the current 2-gram matches any key. If so, we will propose the following tokens with the value.
</p>

<p>Otherwise known as n-gram matching, this approach is effective for use cases like summarization and question-answering, where there is a significant overlap between the prompt and the answer. Instead of using a small model to propose tokens, the system speculates based on the information already available in the prompt. This works particularly well when the large model repeats parts of the prompt in its answers.</p>

<h3 id="medusaeaglemlpspeculator">Medusa/Eagle/MLPSpeculator</h3>
<p align="center">
<picture>
<img src="/assets/figures/spec-decode/figure4.png" width="60%" />
</picture><br />
<i>Picture from https://github.com/FasterDecoding/Medusa</i>.
In the example, three heads are used to propose tokens for the following three positions. Head 1 is proposing ["is", "\'", "the"] for the first position. Head 2 is proposing ["difficult", "is", "\'"] for the second position. Head 3 is proposing ["not", "difficult", "a"] for the third position. All heads take the output of the last transformer block as the input.
</p>

<p>In this method, additional layers (or heads) are added to the large model itself, allowing it to predict multiple tokens in a single forward pass. This reduces the need for a separate draft model, instead leveraging the large model’s own capacity for parallel token generation. Though preliminary, this method shows promise for improving efficiency as more optimized kernels are developed.</p>

<!-- ## Rejection Sampler and Speculative Decoding Worker

vLLM implements a rejection sampler as part of its speculative decoding framework. The sampler helps finalize which tokens are accepted and which are rejected, refining the overall accuracy of the process. Additionally, vLLM uses a speculative decoding worker to manage both the draft model and the target model’s token proposals and verifications, ensuring smooth operations during speculative decoding. -->

<h2 id="speculative-decoding-performance-insights-speedups-and-trade-offs">Speculative Decoding Performance Insights: Speedups and Trade-offs</h2>

<p>Speculative decoding offers significant performance benefits in <strong>low-QPS (queries per second)</strong> environments. For example, in testing on the ShareGPT dataset, vLLM demonstrated up to a 1.5x speedup in token generation when using draft model-based speculative decoding. Similarly, prompt lookup decoding has shown speedups of up to 2.8x when applied to summarization datasets, such as CNN/DailyMail.</p>

<p align="center">
<picture>
<img src="/assets/figures/spec-decode/figure5.png" width="48%" />
</picture>
<picture>
&nbsp; &nbsp;
<img src="/assets/figures/spec-decode/figure6.png" width="48%" />
</picture><br />
Performance comparison showing spec decode delivering up to 1.5x Speedup at QPS=1 Llama3-70B on ShareGPT with 4xH100 using draft model (turboderp/Qwama-0.5B-Instruct) and up to 2.8x Speedup at QPS=1 Llama3-70B on CNN Dailymail with 4xH100 using n-grams.
</p>

<p>However, in <strong>high-QPS environments</strong>, speculative decoding may introduce performance trade-offs. The extra compute required to propose and verify tokens can sometimes slow down the system when it is already compute-bound, as seen when the number of requests per second increases. In such cases, the overhead of speculative decoding can outweigh its benefits, leading to reduced performance.</p>

<p align="center">
<picture>
<img src="/assets/figures/spec-decode/figure7.png" width="80%" />
</picture><br />
As high QPS, we see 1.4x slowdown Llama3-70B on ShareGPT with 4xH100, 1.8x slowdown Llama3-70B on CNN Dailymail with 4xH100
</p>

<h2 id="on-the-roadmap-dynamic-adjustments-for-better-performance">On the Roadmap: Dynamic Adjustments for Better Performance</h2>

<p>To overcome the limitations of speculative decoding in high-QPS settings, vLLM is working on implementing <strong>dynamic speculative decoding</strong>. Feel free to check out the <a href="https://arxiv.org/abs/2406.14066">paper</a> for more detail. This is also one of the active research directions in vllm! This feature will allow vLLM to adjust the number of speculative tokens based on system load and the accuracy of the draft model. At a high level, dynamic speculative decoding shortens the proposed length when system load is high. However, the reduction is less pronounced when the average token acceptance rate is high as shown in the picture below.</p>

<p align="center">
<picture>
<img src="/assets/figures/spec-decode/figure10.png" width="60%" />
</picture><br />
</p>

<p>In the future, the system will be able to automatically modify the degree of speculation at each step, ensuring speculative decoding is always beneficial, regardless of the workload. This will allow users to activate speculative decoding without worrying about whether it will slow down their system.</p>

<h2 id="how-to-use-speculative-decoding-in-vllm">How to Use Speculative Decoding in vLLM</h2>

<p>Setting up speculative decoding in vLLM is straightforward. When launching the vLLM server, you simply need to include the necessary flags to specify the speculative model, the number of tokens, and the tensor parallel size.</p>

<p>The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">vllm</span> <span class="kn">import</span> <span class="n">LLM</span>

<span class="n">llm</span> <span class="o">=</span> <span class="nc">LLM</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">facebook/opt-6.7b</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">speculative_model</span><span class="o">=</span><span class="sh">"</span><span class="s">facebook/opt-125m</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">num_speculative_tokens</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">generate</span><span class="p">(</span><span class="sh">"</span><span class="s">The future of AI is</span><span class="sh">"</span><span class="p">)</span>

<span class="k">for</span> <span class="n">output</span> <span class="ow">in</span> <span class="n">outputs</span><span class="p">:</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Prompt: </span><span class="si">{</span><span class="n">output</span><span class="p">.</span><span class="n">prompt</span><span class="si">!r}</span><span class="s">, Generated text: </span><span class="si">{</span><span class="n">output</span><span class="p">.</span><span class="n">outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span><span class="si">!r}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<p>The following code configures vLLM to use speculative decoding where proposals are generated by matching n-grams in the prompt:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">vllm</span> <span class="kn">import</span> <span class="n">LLM</span>

<span class="n">llm</span> <span class="o">=</span> <span class="nc">LLM</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">facebook/opt-6.7b</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">speculative_model</span><span class="o">=</span><span class="sh">"</span><span class="s">[ngram]</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">num_speculative_tokens</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
    <span class="n">ngram_prompt_lookup_max</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
    <span class="n">ngram_prompt_lookup_min</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">generate</span><span class="p">(</span><span class="sh">"</span><span class="s">The future of AI is</span><span class="sh">"</span><span class="p">)</span>

<span class="k">for</span> <span class="n">output</span> <span class="ow">in</span> <span class="n">outputs</span><span class="p">:</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Prompt: </span><span class="si">{</span><span class="n">output</span><span class="p">.</span><span class="n">prompt</span><span class="si">!r}</span><span class="s">, Generated text: </span><span class="si">{</span><span class="n">output</span><span class="p">.</span><span class="n">outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span><span class="si">!r}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<p>At times, you may want the draft model to operate with a different tensor parallel size than the target model to improve efficiency. This allows the draft model to use fewer resources and has less communication overhead, leaving the more resource-intensive computations to the target model. In vLLM, you can configure the draft model to use a tensor parallel size of 1, while the target model uses a size of 4, as demonstrated in the example below.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">vllm</span> <span class="kn">import</span> <span class="n">LLM</span>

<span class="n">llm</span> <span class="o">=</span> <span class="nc">LLM</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">meta-llama/Meta-Llama-3.1-70B-Instruct</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">tensor_parallel_size</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
    <span class="n">speculative_model</span><span class="o">=</span><span class="sh">"</span><span class="s">ibm-fms/llama3-70b-accelerator</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">speculative_draft_tensor_parallel_size</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">generate</span><span class="p">(</span><span class="sh">"</span><span class="s">The future of AI is</span><span class="sh">"</span><span class="p">)</span>

<span class="k">for</span> <span class="n">output</span> <span class="ow">in</span> <span class="n">outputs</span><span class="p">:</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Prompt: </span><span class="si">{</span><span class="n">output</span><span class="p">.</span><span class="n">prompt</span><span class="si">!r}</span><span class="s">, Generated text: </span><span class="si">{</span><span class="n">output</span><span class="p">.</span><span class="n">outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span><span class="si">!r}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<!-- For draft-model-based decoding, users specify the draft model and the number of tokens to speculate. vLLM also supports **Ngram speculative decoding**, where users only need to specify the number of tokens to speculate. Soon, vLLM will include automatic token speculation, removing the need for manual configuration altogether. -->

<p>Future updates (<a href="https://arxiv.org/abs/2406.14066">paper</a>, <a href="https://github.com/vllm-project/vllm/issues/4565">RFC</a>) will allow vLLM to automatically choose the number of speculative tokens, removing the need for manual configuration and simplifying the process even further.</p>

<p>Follow our docs on <a href="https://docs.vllm.ai/en/v0.6.0/models/spec_decode.html">Speculative Decoding in vLLM</a> to get started. <a href="https://neuralmagic.com/community-office-hours/">Join our bi-weekly office hours to ask questions and give feedback</a>.</p>

<h2 id="conclusion-the-future-of-speculative-decoding-in-vllm">Conclusion: The Future of Speculative Decoding in vLLM</h2>

<p>Speculative decoding in vLLM delivers substantial performance improvements, especially in low-QPS environments. As dynamic adjustments are introduced, it will become a highly effective tool even in high-QPS settings, making it a versatile and essential feature for reducing latency and increasing efficiency in LLM inference.</p>]]></content><author><name>vLLM Team</name></author><summary type="html"><![CDATA[Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in vLLM, how it works, and the performance improvements it brings.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.vllm.ai/assets/figures/spec-decode/figure9.png" /><media:content medium="image" url="https://blog.vllm.ai/assets/figures/spec-decode/figure9.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction</title><link href="https://blog.vllm.ai/2024/09/05/perf-update.html" rel="alternate" type="text/html" title="vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction" /><published>2024-09-05T00:00:00-07:00</published><updated>2024-09-05T00:00:00-07:00</updated><id>https://blog.vllm.ai/2024/09/05/perf-update</id><content type="html" xml:base="https://blog.vllm.ai/2024/09/05/perf-update.html"><![CDATA[<p><strong>TL;DR:</strong> vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.</p>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/llama8B_comparison.png" width="42%" />
</picture><picture>
&nbsp; &nbsp;
<img src="/assets/figures/perf-v060/llama70B_comparison.png" width="42%" />
</picture><br />
Performance comparison between vLLM v0.5.3 and v0.6.0 for Llama 8B on 1xH100 and 70B on 4xH100 on ShareGPT dataset (500 prompts). TPOT measured at 32 QPS.
</p>

<p>A month ago, we released our <a href="https://blog.vllm.ai/2024/07/25/lfai-perf.html">performance roadmap</a> committing to performance as our top priority. Today, we released vLLM v0.6.0, with 1.8-2.7x throughput improvements compared to v0.5.3, reaching state-of-the-art performance while keeping rich features and great usability.</p>

<p>We will start by diagnosing the performance bottleneck in vLLM previously. Then we will describe the solution we implemented and landed in the past month. Finally, we will showcase the benchmarks of the latest vLLM release v0.6.0 other inference engines.</p>

<h3 id="performance-diagnosis">Performance Diagnosis</h3>

<p>LLM inference requires tight collaboration between CPUs and GPUs. Although the major computation happens in GPUs, CPUs also play an important role in serving and scheduling requests. If CPUs cannot schedule fast enough, GPUs will sit idle to wait for CPUs, which eventually leads to inefficient GPU utilization and hinders inference performance.</p>

<p>One year ago, when vLLM was first released, we mainly optimized for relatively large models on GPUs with limited memory (e.g. Llama 13B on NVIDIA A100-40G). As faster GPUs with larger memory (like NVIDIA H100) become more available and models become more optimized for inference (e.g. with techniques like GQA and quantization), the time spent on other CPU parts of the inference engine becomes a significant bottleneck. Specifically, our profiling results show that for Llama 3 8B running on 1 H100 GPU:</p>

<ul>
  <li>The HTTP API server takes 33% of the total execution time.</li>
  <li>29% of the total execution time is spent on scheduling, including gathering the LLM results from the last step, scheduling the requests to run for the next step, and preparing these requests as inputs for the LLMs.</li>
  <li>Finally, only 38% of the time was spent on the actual GPU execution for LLMs.</li>
</ul>

<p>We found two main issues in vLLM through the benchmark above:</p>

<ul>
  <li><strong>High CPU overhead.</strong> The CPU components of vLLM take a surprisingly long time. To make vLLM’s code easy to understand and contribute, we keep most of vLLM in Python and use many Python native data structures (e.g., Python Lists and Dicts). This becomes a significant overhead that causes the scheduling and data preparation time to be high.</li>
  <li><strong>Lack of asynchronicity among different components.</strong> In vLLM, many components (e.g., scheduler and output processor) execute in a synchronous manner that blocks GPU execution. This is mainly due to 1) our original assumption that model execution would be much slower than the CPU parts and 2) ease of implementation for many complicated scheduling situations (e.g., scheduling for beam search). However, this issue causes the GPU to wait for the CPU and reduces its utilization.</li>
</ul>

<p>To summarize, the performance bottleneck of vLLM is mainly caused by <em>the CPU overhead that blocks the GPU execution</em>. In vLLM v0.6.0, we introduce a series of optimizations to minimize these overheads.</p>

<h3 id="performance-enhancements">Performance Enhancements</h3>

<p>To make sure we can keep GPUs busy, we made several enhancements:</p>

<h4 id="separating-api-server-and-inference-engine-into-different-processes-pr-6883">Separating API server and inference engine into different processes (<a href="https://github.com/vllm-project/vllm/pull/6883">PR #6883</a>)</h4>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/illustration-api-server.png" width="90%" />
</picture>
Illustration of the serving process architecture before and after. We separated the http serving component from the vLLM engine, and connected them with a ZMQ socket. This architecture ensures both CPU heavy components are isolated from each other.
</p>

<p>Through careful profiling, we found that managing network requests and formatting the response for OpenAI protocol can consume quite a bit of CPU cycles, especially under high load with token streaming enabled. For example, Llama3 8B can generate 1 token every 13 ms under light load. This translates to the frontend needing to stream back 76 objects per second, and this demand further increases with hundreds of concurrent requests. This posed a challenge for the previous version of vLLM, where the API server and the inference engine were running in the same process. As a result, the inference engine and API server coroutines had to compete for Python GIL, leading to CPU contention.</p>

<p>Our solution is to separate out the API server, which handles request validation, tokenization, and JSON formatting, from the engine, which manages request scheduling and model inference. We connect these two Python processes using ZMQ, which has low overhead. By eliminating GIL constraints, both components can operate more efficiently without CPU contention, leading to improved performance.</p>

<p>Even after splitting these two processes, we find there’s still much room for improvement in terms of how we process requests in the engine and how we interact with http requests. We are actively working on further improving the performance of API server (<a href="https://github.com/vllm-project/vllm/pull/8157">PR #8157</a>), towards making it as efficient as offline batching inference in the near future.</p>

<h4 id="batch-scheduling-multiple-steps-ahead-pr-7000">Batch scheduling multiple steps ahead (<a href="https://github.com/vllm-project/vllm/pull/7000">PR #7000</a>)</h4>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/illustration-multi-step.png" width="90%" />
</picture>
<br />
Illustration of the multistep scheduling method in vLLM. By batching multiple scheduling steps at once, we keep the GPU busier than before, therefore reducing latency and improve throughput.
</p>

<p>We identified that the CPU overhead from vLLM’s scheduler and input preparation was leading to GPU underutilization, resulting in suboptimal throughput. To tackle this, we introduced <em>multi-step scheduling</em>, which performs scheduling and input preparation once and runs the model for <code class="language-plaintext highlighter-rouge">n</code> consecutive steps. By ensuring that the GPU can continue processing between the <code class="language-plaintext highlighter-rouge">n</code> steps without waiting for the CPU, this approach spreads the CPU overhead across multiple steps, significantly reducing GPU idle time and boosting overall performance.</p>

<p>This improves the throughput of running Llama 70B models on 4xH100 by 28%.</p>

<h4 id="asynchronous-output-processing-pr-7049-7921-8050">Asynchronous output processing (<a href="https://github.com/vllm-project/vllm/pull/7049">PR #7049</a>, <a href="https://github.com/vllm-project/vllm/pull/7921">#7921</a>, <a href="https://github.com/vllm-project/vllm/pull/8050">#8050</a>)</h4>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/illustration-async-output-processing.png" width="90%" />
</picture>
<br />
Illustration of the asynchronous output processing in vLLM. By overlapping the CPU work for output data structure processing with the GPU computation, we reduced GPU idle time and improved throughput.
</p>

<p>Continuing our efforts to maximize GPU utilization, we also revamped how the model output is processed in vLLM.</p>

<p>Previously, after generating each token, vLLM moved the model output from GPU to CPU, checked the stopping criteria to determine if the request had finished, and then executed the next step. This output processing was often slow, involving de-tokenizing the generated token IDs and performing string matching, with the overhead increasing as batch sizes grew.</p>

<p>To address this inefficiency, we introduced <em>asynchronous output processing</em>, which overlaps the output processing with model execution. Instead of processing the output immediately, vLLM now delays it, performing the processing of the <code class="language-plaintext highlighter-rouge">n</code>-th step output while executing the <code class="language-plaintext highlighter-rouge">n+1</code>-th step. This approach assumes that no request from the <code class="language-plaintext highlighter-rouge">n</code>-th step has met the stopping criteria, incurring a slight overhead of executing one additional step per request. However, the significant boost in GPU utilization more than offsets this cost, leading to improved overall performance.</p>

<p>This improves the time-per-output-token of running Llama 70B models on 4xH100 by 8.7%.</p>

<h4 id="miscellaneous-optimization">Miscellaneous optimization</h4>
<p>To further reduce the CPU overhead, we carefully examined the whole codebase and performed the following optimizations:</p>
<ul>
  <li>As requests come and finish, Python will allocate new objects and deallocate them again and again. To alleviate this overhead, we create an object cache (<a href="https://github.com/vllm-project/vllm/pull/7162">#7162</a>) to hold these objects, which significantly improves the end-to-end throughput by 24%.</li>
  <li>When sending data from CPU to GPU, we use non-blocking operations (<a href="https://github.com/vllm-project/vllm/pull/7172">#7172</a>) as much as possible. The CPU can launch many copy operations while the GPU is copying the data.</li>
  <li>vLLM supports diverse attention backends and sampling algorithms.  For commonly used workloads with simple sampling requests (<a href="https://github.com/vllm-project/vllm/pull/7117">#7117</a>), we introduce a fast code path that skips the complex steps.</li>
</ul>

<p>Over the last month, the vLLM community has devoted many efforts for such optimizations. And we will continue to optimize the code base to improve the efficiency.</p>

<h3 id="performance-benchmarks">Performance Benchmarks</h3>

<p>With the above efforts, we are happy to share that vLLM’s performance has improved a lot compared with last month’s vLLM. And it reaches state-of-the-art performance according to our performance benchmarks.</p>

<p><strong>Serving engines.</strong> We benchmark the vLLM v0.6.0 against TensorRT-LLM r24.07, SGLang v0.3.0, and lmdeploy v0.6.0a0. For other benchmarks, we use their default setting. For vLLM, we have turned on multistep scheduling via setting <code class="language-plaintext highlighter-rouge">--num-scheduler-steps 10</code>. We are actively working on making it on by default.</p>

<p><strong>Dataset.</strong> We benchmark different serving engines using the following three datasets:</p>

<ul>
  <li><strong>ShareGPT</strong>: 500 prompts randomly sampled from ShareGPT dataset with fixed random seed.
    <ul>
      <li>Average input tokens: 202, average output tokens: 179</li>
    </ul>
  </li>
  <li><strong>Prefill-heavy dataset</strong>: 500 prompts synthetically generated from sonnet dataset with roughly 462 input tokens and 16 output tokens on average.</li>
  <li><strong>Decode-heavy dataset</strong>: 500 prompts synthetically generated from sonnet dataset with roughly the same amount of 462 input tokens and 256 output tokens on average.</li>
</ul>

<p><strong>Models.</strong> We benchmark on two models: Llama 3 8B and 70B. We did not use the latest Llama 3.1 models as TensorRT-LLM r24.07 with TensorRT LLM backend v0.11 does not support it (<a href="https://github.com/NVIDIA/TensorRT-LLM/issues/2105">issue link</a>).</p>

<p><strong>Hardware.</strong> We use A100 and H100 for benchmarking. They are the major two high-end GPUs used for inference.</p>

<p><strong>Mertics.</strong> We evaluate the following metrics:</p>

<ul>
  <li>Time-to-first-token (TTFT, measured in ms). We show the mean and standard error of the mean in the plots.</li>
  <li>Time-per-output-token (TPOT, measured in ms). We show the mean and standard error of the mean in the plots.</li>
  <li>Throughput (measured in request per second).
    <ul>
      <li>Throughput is measured under QPS inf (meaning that all requests come at once).</li>
    </ul>
  </li>
</ul>

<h4 id="benchmarking-results">Benchmarking results</h4>

<p>In ShareGPT and Decode-heavy dataset, vLLM achieves <strong>highest throughput on H100</strong> when serving Llama-3 models.</p>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/overall_throughput.png" width="90%" />
</picture>
<br />
Across different workloads, vLLM achieves high throughput compared to other frameworks, for Llama 8B and 70B on H100.
</p>

<p>For the rest of performance benchmarks, as well as captured detailed metrics for time-to-first-token (TTFT) and time-per-output-token (TPOT), please refer to the <a href="#appendix">appendix</a> for more data and analysis. You can follow <a href="https://github.com/vllm-project/vllm/issues/8176">this github issue</a> to reproduce our benchmark.</p>

<p><strong>Limitation of current optimizations.</strong> Although our current optimizations give a significant throughput gain, there are performance trade-offs from our current optimizations, especially from multi-step scheduling:</p>

<ul>
  <li><em>Bumpy inter-token latency:</em> In our current implementation of multi-step scheduling, we also return the output tokens for multiple steps in a batch. From an end-user’s perspective, they will receive batches of tokens being replied. We are fixing this by streaming the intermediate tokens back to the engine.</li>
  <li><em>Higher TTFT at low request rate:</em> A new request can only start execution after the current multi-step execution finishes. Therefore, higher <code class="language-plaintext highlighter-rouge">--num-scheduler-steps</code> will lead to higher TTFT at low request rates. Our experiments focus on the queueing delay at high QPS so this effect is not significant in the results in the appendix.</li>
</ul>

<h3 id="conclusion--future-work">Conclusion &amp; Future Work</h3>

<p>In this post, we discussed the performance enhancements in vLLM that lead to 1.8-2.7x throughput increase and matching other inference engines. We remain committed to steadily improving the performance, while continuously broadening our model coverages, hardware support, and diverse features. For the features discussed in this post, we will continue to harden them for production readiness.</p>

<p>Importantly, we will also focus on improving the core of vLLM to reduce the complexity so it lowers the barriers for contribution and unlocking even more performance enhancements.</p>

<h3 id="get-involved">Get Involved</h3>

<p>If you haven’t, we highly recommend you to update the vLLM version (see instructions <a href="https://docs.vllm.ai/en/latest/getting\_started/installation.html">here</a>) and try it out for yourself! We always love to learn more about your use cases and how we can make vLLM better for you. The vLLM team can be reached out via <a href="mailto:vllm-questions@lists.berkeley.edu">vllm-questions@lists.berkeley.edu</a>. vLLM is also a community project, if you are interested in participating and contributing, we welcome you to check out our <a href="https://roadmap.vllm.ai/">roadmap</a> and see <a href="https://github.com/vllm-project/vllm/issues?q=is:open+is:issue+label:%22good+first+issue%22">good first issues</a> to tackle. Stay tuned for more updates by <a href="https://x.com/vllm\_project">following us on X</a>.</p>

<p>If you are in the Bay Area, you can meet the vLLM team at the following events: <a href="https://lu.ma/87q3nvnh">vLLM’s sixth meetup with NVIDIA(09/09)</a>, <a href="https://pytorch2024.sched.com/event/1fHmx/vllm-easy-fast-and-cheap-llm-serving-for-everyone-woosuk-kwon-uc-berkeley-xiaoxuan-liu-ucb">PyTorch Conference (09/19)</a>, <a href="https://events.accel.com/cudamode">CUDA MODE IRL meetup (09/21)</a>, and <a href="https://raysummit.anyscale.com/flow/anyscale/raysummit2024/landing/page/sessioncatalog?search.sessiontracks=1719251906298001uzJ2">the first ever vLLM track at Ray Summit (10/01-02)</a>.</p>

<p>Regardless where you are, don’t forget to sign up for the online <a href="https://neuralmagic.com/community-office-hours/">biweekly vLLM office hours</a>! There are always new topics discussed every two weeks. The next one will be a deep dive into the performance enhancements.</p>

<h3 id="acknowledgment">Acknowledgment</h3>

<p>The blogpost is drafted by the vLLM team at Berkeley. The performance boost comes from collective efforts in the vLLM community: <a href="https://github.com/robertgshaw2-neuralmagic">Robert Shaw</a> from Neural Magic and <a href="https://github.com/njhill">Nick Hill</a>, <a href="https://github.com/joerunde">Joe Runde</a> from IBM lead the API server refactoring, <a href="https://github.com/SolitaryThinker">Will Lin</a> from UCSD and <a href="https://github.com/Yard1">Antoni Baum</a>, <a href="https://github.com/comaniac">Cody Yu</a> from Anyscale lead the multi-step scheduling effort, <a href="https://github.com/megha95">Megha Agarwal</a> from Databricks and <a href="https://github.com/alexm-neuralmagic">Alexander Matveev</a> from Neural Magic lead the async output processing, and many contributors from the vLLM community contribute various optimizations. All these efforts bring us together to get a huge performance boost.</p>

<h2 id="appendix">Appendix</h2>

<p>We include the detailed experiment results in this section.</p>

<h4 id="llama-3-8b-on-1xa100">Llama 3 8B on 1xA100</h4>

<p>On Llama 3 8B, vLLM achieves comparable TTFT and TPOT on ShareGPT and decode-heavy dataset as TensorRT-LLM and SGLang. LMDeploy has lower TPOT compared to other engines but has higher TTFT in general. Throughput-wise, TensorRT-LLM has the highest throughput among all engines, and vLLM has the second highest throughput on ShareGPT and decode-heavy dataset.</p>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/A100_8B.png" width="90%" />
</picture>
</p>

<h4 id="llama-3-70b-on-4xa100">Llama 3 70B on 4xA100</h4>

<p>On Llama 3 70B, vLLM, SGLang and TensorRT-LLM have similar TTFT and TPOT (LMDeploy has lower TPOT but higher TTFT). Throughput-wise, vLLM achieves highest throughput on ShareGPT dataset and comparable throughput compared to other engines on other datasets.</p>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/A100_70B.png" width="90%" />
</picture>
</p>

<h4 id="llama-3-8b-on-1xh100">Llama 3 8B on 1xH100</h4>

<p>vLLM achieves state-of-the-art throughput on ShareGPT and Decode-heavy dataset, though it has lower throughput on Prefill-heavy dataset.</p>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/H100_8B.png" width="90%" />
</picture>
</p>

<h5 id="llama-3-70b-on-4xh100">Llama 3 70B on 4xH100</h5>

<p>vLLM has highest throughput on ShareGPT and Decode-heavy dataset (though the throughput is only marginally higher than TensorRT-LLM), but the throughput of vLLM is lower on Prefill-heavy dataset.</p>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/H100_70B.png" width="90%" />
</picture>
</p>]]></content><author><name>vLLM Team</name></author><summary type="html"><![CDATA[TL;DR: vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.vllm.ai/assets/figures/perf-v060/llama8B_comparison.png" /><media:content medium="image" url="https://blog.vllm.ai/assets/figures/perf-v060/llama8B_comparison.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">vLLM’s Open Governance and Performance Roadmap</title><link href="https://blog.vllm.ai/2024/07/25/lfai-perf.html" rel="alternate" type="text/html" title="vLLM’s Open Governance and Performance Roadmap" /><published>2024-07-25T00:00:00-07:00</published><updated>2024-07-25T00:00:00-07:00</updated><id>https://blog.vllm.ai/2024/07/25/lfai-perf</id><content type="html" xml:base="https://blog.vllm.ai/2024/07/25/lfai-perf.html"><![CDATA[<p>We would like to share two updates to the vLLM community.</p>

<h3 id="future-of-vllm-is-open">Future of vLLM is Open</h3>

<p align="center">
<picture>
<img src="/assets/figures/lfai/vllm-lfai-light.png" width="60%" />
</picture>
</p>

<p>We are excited to see vLLM is becoming the standard for LLM inference and serving. In the recent <a href="https://ai.meta.com/blog/meta-llama-3-1/">Meta Llama 3.1 announcement</a>, 8 out of 10 official partners for real time inference run vLLM as the serving engine for the Llama 3.1 models. We have also heard anecdotally that vLLM is being used in many of the AI features in our daily life.</p>

<p>We believe vLLM’s success comes from the power of the strong open source community. vLLM is actively maintai
ned by a consortium of groups such as UC Berkeley, Anyscale, AWS, CentML, Databricks, IBM, Neural Magic, Roblox,
 Snowflake, and others. To this extent, we want to ensure the ownership and governance of the project is open an
d transparent as well.</p>

<p>We are excited to announce that vLLM has <a href="https://lfaidata.foundation/blog/2024/07/17/lf-ai-data-foundation-mid-year-review-significant-growth-in-the-first-half-of-2024/?hss_channel=tw-976478457881247745">started the incubation process into LF AI &amp; Data Foundation</a>. This means no one party will have exclusive control over the future of vLLM. The license and trademark will be irrevocably open. You can trust vLLM is here to stay and will be actively maintained and improved going forward.</p>

<h3 id="performance-is-top-priority">Performance is top priority</h3>

<p>The vLLM contributors are doubling down to ensure vLLM is a fastest and easiest-to-use LLM inference and serving engine.</p>

<p>To recall our roadmap, we focus vLLM on six objectives: wide model coverage, broad hardware support, top performance, production-ready, thriving open source community, and extensible architecture.</p>

<p>In our objective for performance optimization, we have made the following progress to date:</p>

<ul>
  <li>Publication of benchmarks
    <ul>
      <li>Published per-commit performance tracker at <a href="https://perf.vllm.ai">perf.vllm.ai</a> on our public benchmarks. The goal of this is to track performance enhancement and regressions.</li>
      <li>Published reproducible benchmark (<a href="https://docs.vllm.ai/en/latest/performance_benchmark/benchmarks.html">docs</a>) of vLLM compared to LMDeploy, TGI, and TensorRT-LLM. The goal is to identify gaps in performance and close them.</li>
    </ul>
  </li>
  <li>Development and integration of highly optimized kernels
    <ul>
      <li>Integrated FlashAttention2 with PagedAttention, and <a href="https://github.com/flashinfer-ai/flashinfer">FlashInfer</a>. We plan to integrate <a href="https://github.com/vllm-project/vllm/issues/6348">FlashAttention3</a>.</li>
      <li>Integrating <a href="https://arxiv.org/abs/2406.06858v1">Flux</a> which overlaps computation and collective communication.</li>
      <li>Developed state of the art kernels for quantized inference, including INT8 and FP8 activation quantization (via cutlass) and INT4, INT8, and FP8 weight-only quantization for GPTQ and AWQ (via marlin).</li>
    </ul>
  </li>
  <li>Started several work streams to lower critical overhead
    <ul>
      <li>We identified vLLM’s synchronous and blocking scheduler is a key bottleneck for models running on fast GPUs (H100s). We are working on making the schedule asynchronous and plan steps ahead of time.</li>
      <li>We identified vLLM’s OpenAI-compatible API frontend has higher than desired overhead. <a href="https://github.com/vllm-project/vllm/issues/6797">We are working on isolating it from the critical path of scheduler and model inference. </a></li>
      <li>We identified vLLM’s input preparation, and output processing scale suboptimally with the data size. Many of the operations can be vectorized and enhanced by moving them off the critical path.</li>
    </ul>
  </li>
</ul>

<p>We will continue to update the community in vLLM’s progress in closing the performance gap. You can track our overall progress <a href="https://github.com/vllm-project/vllm/issues/6801">here</a>. Please continue to suggest new ideas and contribute with your improvements!</p>

<h3 id="more-resources">More Resources</h3>

<p>We would like to highlight the following RPCs being actively developed</p>

<ul>
  <li><a href="https://github.com/vllm-project/vllm/issues/6556">Single Program Multiple Data (SPMD) Worker Control Plane</a> reduces complexity and enhances performance of tensor parallel performance.</li>
  <li><a href="https://github.com/vllm-project/vllm/issues/6378">A Graph Optimization System in vLLM using torch.compile</a> brings in PyTorch native compilation workflow for kernel fusion and compilation.</li>
  <li><a href="https://github.com/vllm-project/vllm/issues/5557">Implement disaggregated prefilling via KV cache transfer</a> is critical for workload with long input and lowers variance in inter-token latency.</li>
</ul>

<p>There is a thriving research community building their research projects on top of vLLM. We are deeply humbled by the impressive works and would love to collaborate and integrate. The list of papers includes but is not limited to:</p>

<ul>
  <li><a href="https://www.usenix.org/conference/osdi24/presentation/agrawal">Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve</a></li>
  <li><a href="https://arxiv.org/abs/2407.00079">Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving</a></li>
  <li><a href="https://arxiv.org/abs/2406.03243">Llumnix: Dynamic Scheduling for Large Language Model Serving</a></li>
  <li><a href="https://arxiv.org/abs/2310.07240">CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving</a></li>
  <li><a href="https://arxiv.org/abs/2405.04437">vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention</a></li>
  <li><a href="https://arxiv.org/abs/2404.16283">Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services</a></li>
  <li><a href="https://arxiv.org/abs/2312.07104">SGLang: Efficient Execution of Structured Language Model Programs</a></li>
</ul>]]></content><author><name>vLLM Team</name></author><summary type="html"><![CDATA[We would like to share two updates to the vLLM community.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.vllm.ai/assets/figures/lfai/vllm-lfai-light.png" /><media:content medium="image" url="https://blog.vllm.ai/assets/figures/lfai/vllm-lfai-light.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>