This repository has been archived by the owner on Feb 7, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathfeed.xml
2050 lines (1541 loc) · 179 KB
/
feed.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.3">Jekyll</generator><link href="https://blog.vllm.ai/feed.xml" rel="self" type="application/atom+xml" /><link href="https://blog.vllm.ai/" rel="alternate" type="text/html" /><updated>2025-01-30T11:55:29-08:00</updated><id>https://blog.vllm.ai/feed.xml</id><title type="html">vLLM Blog</title><author><name>© 2025. vLLM Team. All rights reserved.</name></author><entry><title type="html">Introducing vLLM Inference Provider in Llama Stack</title><link href="https://blog.vllm.ai/2025/01/27/intro-to-llama-stack-with-vllm.html" rel="alternate" type="text/html" title="Introducing vLLM Inference Provider in Llama Stack" /><published>2025-01-27T00:00:00-08:00</published><updated>2025-01-27T00:00:00-08:00</updated><id>https://blog.vllm.ai/2025/01/27/intro-to-llama-stack-with-vllm</id><content type="html" xml:base="https://blog.vllm.ai/2025/01/27/intro-to-llama-stack-with-vllm.html"><![CDATA[<p>We are excited to announce that vLLM inference provider is now available in <a href="https://github.com/meta-llama/llama-stack">Llama Stack</a> through the collaboration between the Red Hat AI Engineering team and the Llama Stack team from Meta. This article provides an introduction to this integration and a tutorial to help you get started using it locally or deploying it in a Kubernetes cluster.</p>
<h1 id="what-is-llama-stack">What is Llama Stack?</h1>
<p><img align="right" src="/assets/figures/llama-stack/llama-stack.png" alt="llama-stack-diagram" width="50%" height="50%" /></p>
<p>Llama Stack defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Service Providers providing their implementations.</p>
<p>Llama Stack focuses on making it easy to build production applications with a variety of models - ranging from the latest Llama 3.3 model to specialized models like Llama Guard for safety and other models. The goal is to provide pre-packaged implementations (aka “distributions”) which can be run in a variety of deployment environments. The Stack can assist you in your entire app development lifecycle - start iterating on local, mobile or desktop and seamlessly transition to on-prem or public cloud deployments. At every point in this transition, the same set of APIs and the same developer experience are available.</p>
<p>Each specific implementation of an API is called a “Provider” in this architecture. Users can swap providers via configuration. vLLM is a prominent example of a high-performance API backing the inference API.</p>
<h1 id="vllm-inference-provider">vLLM Inference Provider</h1>
<p>Llama Stack provides two vLLM inference providers:</p>
<ol>
<li><a href="https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/remote-vllm.html">Remote vLLM inference provider</a> through vLLM’s <a href="https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-completions-api-with-vllm">OpenAI-compatible server</a>;</li>
<li><a href="https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline/inference/vllm">Inline vLLM inference provider</a> that runs alongside with Llama Stack server.</li>
</ol>
<p>In this article, we will demonstrate the functionality through the remote vLLM inference provider.</p>
<h1 id="tutorial">Tutorial</h1>
<h2 id="prerequisites">Prerequisites</h2>
<ul>
<li>Linux operating system</li>
<li><a href="https://huggingface.co/docs/huggingface_hub/main/en/guides/cli">Hugging Face CLI</a> if you’d like to download the model via CLI.</li>
<li>OCI-compliant container technologies like <a href="https://podman.io/">Podman</a> or <a href="https://www.docker.com/">Docker</a> (can be specified via the <code class="language-plaintext highlighter-rouge">CONTAINER_BINARY</code> environment variable when running <code class="language-plaintext highlighter-rouge">llama stack</code> CLI commands).</li>
<li><a href="https://kind.sigs.k8s.io/">Kind</a> for Kubernetes deployment.</li>
<li><a href="https://github.com/conda/conda">Conda</a> for managing Python environment.</li>
</ul>
<h2 id="get-started-via-containers">Get Started via Containers</h2>
<h3 id="start-vllm-server">Start vLLM Server</h3>
<p>We first download the “Llama-3.2-1B-Instruct” model using the <a href="https://huggingface.co/docs/huggingface_hub/main/en/guides/cli">Hugging Face CLI</a>. Note that you’ll need to specify your Hugging Face token when logging in.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> /tmp/test-vllm-llama-stack
huggingface-cli login <span class="nt">--token</span> <YOUR-HF-TOKEN>
huggingface-cli download meta-llama/Llama-3.2-1B-Instruct <span class="nt">--local-dir</span> /tmp/test-vllm-llama-stack/.cache/huggingface/hub/models/Llama-3.2-1B-Instruct
</code></pre></div></div>
<p>Next, let’s build the vLLM CPU container image from source. Note that while we use it for demonstration purposes, there are plenty of <a href="https://docs.vllm.ai/en/latest/getting_started/installation/index.html">other images available for different hardware and architectures</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone [email protected]:vllm-project/vllm.git /tmp/test-vllm-llama-stack
cd /tmp/test-vllm-llama-stack/vllm
podman build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .
</code></pre></div></div>
<p>We can then start the vLLM container:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>podman run <span class="nt">-it</span> <span class="nt">--network</span><span class="o">=</span>host <span class="se">\</span>
<span class="nt">--group-add</span><span class="o">=</span>video <span class="se">\</span>
<span class="nt">--ipc</span><span class="o">=</span>host <span class="se">\</span>
<span class="nt">--cap-add</span><span class="o">=</span>SYS_PTRACE <span class="se">\</span>
<span class="nt">--security-opt</span> <span class="nv">seccomp</span><span class="o">=</span>unconfined <span class="se">\</span>
<span class="nt">--device</span> /dev/kfd <span class="se">\</span>
<span class="nt">--device</span> /dev/dri <span class="se">\</span>
<span class="nt">-v</span> /tmp/test-vllm-llama-stack/.cache/huggingface/hub/models/Llama-3.2-1B-Instruct:/app/model <span class="se">\</span>
<span class="nt">--entrypoint</span><span class="o">=</span><span class="s1">'["python3", "-m", "vllm.entrypoints.openai.api_server", "--model", "/app/model", "--served-model-name", "meta-llama/Llama-3.2-1B-Instruct", "--port", "8000"]'</span> <span class="se">\</span>
vllm-cpu-env
</code></pre></div></div>
<p>We can get a list of models and test a prompt once the model server has started:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/completions <span class="se">\</span>
<span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="se">\</span>
<span class="nt">-d</span> <span class="s1">'{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'</span>
</code></pre></div></div>
<h3 id="start-llama-stack-server">Start Llama Stack Server</h3>
<p>Once we verify that the vLLM server has started successfully and is able to serve requests, we can then build and start the Llama Stack server.</p>
<p>First, we clone the Llama Stack source code and create a Conda environment that includes all the dependencies:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone [email protected]:meta-llama/llama-stack.git /tmp/test-vllm-llama-stack/llama-stack
cd /tmp/test-vllm-llama-stack/llama-stack
conda create -n stack python=3.10
conda activate stack
pip install .
</code></pre></div></div>
<p>Next, we build the container image with <code class="language-plaintext highlighter-rouge">llama stack build</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cat > /tmp/test-vllm-llama-stack/vllm-llama-stack-build.yaml << "EOF"
name: vllm
distribution_spec:
description: Like local, but use vLLM for running LLM inference
providers:
inference: remote::vllm
safety: inline::llama-guard
agents: inline::meta-reference
vector_io: inline::faiss
datasetio: inline::localfs
scoring: inline::basic
eval: inline::meta-reference
post_training: inline::torchtune
telemetry: inline::meta-reference
image_type: container
EOF
export CONTAINER_BINARY=podman
LLAMA_STACK_DIR=. PYTHONPATH=. python -m llama_stack.cli.llama stack build --config /tmp/test-vllm-llama-stack/vllm-llama-stack-build.yaml --image-name distribution-myenv
</code></pre></div></div>
<p>Once the container image has been built successfully, we can then edit the generated <code class="language-plaintext highlighter-rouge">vllm-run.yaml</code> to be <code class="language-plaintext highlighter-rouge">/tmp/test-vllm-llama-stack/vllm-llama-stack-run.yaml</code> with the following change in the <code class="language-plaintext highlighter-rouge">models</code> field:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>models:
- metadata: {}
model_id: ${env.INFERENCE_MODEL}
provider_id: vllm
provider_model_id: null
</code></pre></div></div>
<p>Then we can start the Llama Stack Server with the image we built via <code class="language-plaintext highlighter-rouge">llama stack run</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>export INFERENCE_ADDR=host.containers.internal
export INFERENCE_PORT=8000
export INFERENCE_MODEL=meta-llama/Llama-3.2-1B-Instruct
export LLAMA_STACK_PORT=5000
LLAMA_STACK_DIR=. PYTHONPATH=. python -m llama_stack.cli.llama stack run \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env VLLM_URL=http://$INFERENCE_ADDR:$INFERENCE_PORT/v1 \
--env VLLM_MAX_TOKENS=8192 \
--env VLLM_API_TOKEN=fake \
--env LLAMA_STACK_PORT=$LLAMA_STACK_PORT \
/tmp/test-vllm-llama-stack/vllm-llama-stack-run.yaml
</code></pre></div></div>
<p>Alternatively, we can run the following <code class="language-plaintext highlighter-rouge">podman run</code> command instead:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>podman run --security-opt label=disable -it --network host -v /tmp/test-vllm-llama-stack/vllm-llama-stack-run.yaml:/app/config.yaml -v /tmp/test-vllm-llama-stack/llama-stack:/app/llama-stack-source \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env VLLM_URL=http://$INFERENCE_ADDR:$INFERENCE_PORT/v1 \
--env VLLM_MAX_TOKENS=8192 \
--env VLLM_API_TOKEN=fake \
--env LLAMA_STACK_PORT=$LLAMA_STACK_PORT \
--entrypoint='["python", "-m", "llama_stack.distribution.server.server", "--yaml-config", "/app/config.yaml"]' \
localhost/distribution-myenv:dev
</code></pre></div></div>
<p>Once we start the Llama Stack server successfully, we can then start testing a inference request:</p>
<p>Via Bash:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"
</code></pre></div></div>
<p>Output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ChatCompletionResponse(
completion_message=CompletionMessage(
content="Hello! I'm an AI, a conversational AI model. I'm a type of computer program designed to understand and respond to human language. My creators have
trained me on a vast amount of text data, allowing me to generate human-like responses to a wide range of questions and topics. I'm here to help answer any question you
may have, so feel free to ask me anything!",
role='assistant',
stop_reason='end_of_turn',
tool_calls=[]
),
logprobs=None
)
</code></pre></div></div>
<p>Via Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">os</span>
<span class="kn">from</span> <span class="n">llama_stack_client</span> <span class="kn">import</span> <span class="n">LlamaStackClient</span>
<span class="n">client</span> <span class="o">=</span> <span class="nc">LlamaStackClient</span><span class="p">(</span><span class="n">base_url</span><span class="o">=</span><span class="sa">f</span><span class="sh">"</span><span class="s">http://localhost:</span><span class="si">{</span><span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="sh">'</span><span class="s">LLAMA_STACK_PORT</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
<span class="c1"># List available models
</span><span class="n">models</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">models</span><span class="p">.</span><span class="nf">list</span><span class="p">()</span>
<span class="nf">print</span><span class="p">(</span><span class="n">models</span><span class="p">)</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">inference</span><span class="p">.</span><span class="nf">chat_completion</span><span class="p">(</span>
<span class="n">model_id</span><span class="o">=</span><span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="sh">"</span><span class="s">INFERENCE_MODEL</span><span class="sh">"</span><span class="p">],</span>
<span class="n">messages</span><span class="o">=</span><span class="p">[</span>
<span class="p">{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">system</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">You are a helpful assistant.</span><span class="sh">"</span><span class="p">},</span>
<span class="p">{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">user</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">Write a haiku about coding</span><span class="sh">"</span><span class="p">}</span>
<span class="p">]</span>
<span class="p">)</span>
<span class="nf">print</span><span class="p">(</span><span class="n">response</span><span class="p">.</span><span class="n">completion_message</span><span class="p">.</span><span class="n">content</span><span class="p">)</span>
</code></pre></div></div>
<p>Output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Model(identifier='meta-llama/Llama-3.2-1B-Instruct', metadata={}, api_model_type='llm', provider_id='vllm', provider_resource_id='meta-llama/Llama-3.2-1B-Instruct', type='model', model_type='llm')]
Here is a haiku about coding:
Columns of code flow
Logic codes the endless night
Tech's silent dawn rise
</code></pre></div></div>
<h2 id="deployment-on-kubernetes">Deployment on Kubernetes</h2>
<p>Instead of starting the Llama Stack and vLLM servers locally. We can deploy them in a Kubernetes cluster. We’ll use a local Kind cluster for demonstration purposes:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kind create cluster --image kindest/node:v1.32.0 --name llama-stack-test
</code></pre></div></div>
<p>Start vLLM server as a Kubernetes Pod and Service (remember to replace <code class="language-plaintext highlighter-rouge"><YOUR-HF-TOKEN></code> with your actual token):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cat <<EOF |kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-models
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
type: Opaque
data:
token: "<YOUR-HF-TOKEN>"
---
apiVersion: v1
kind: Pod
metadata:
name: vllm-server
labels:
app: vllm
spec:
containers:
- name: llama-stack
image: localhost/vllm-cpu-env:latest
command:
- bash
- -c
- |
MODEL="meta-llama/Llama-3.2-1B-Instruct"
MODEL_PATH=/app/model/$(basename $MODEL)
huggingface-cli login --token $HUGGING_FACE_HUB_TOKEN
huggingface-cli download $MODEL --local-dir $MODEL_PATH --cache-dir $MODEL_PATH
python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH --served-model-name $MODEL --port 8000
ports:
- containerPort: 8000
volumeMounts:
- name: llama-storage
mountPath: /app/model
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
volumes:
- name: llama-storage
persistentVolumeClaim:
claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
name: vllm-server
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
type: NodePort
EOF
</code></pre></div></div>
<p>We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ kubectl logs vllm-server
...
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
</code></pre></div></div>
<p>Then we can modify the previously created <code class="language-plaintext highlighter-rouge">vllm-llama-stack-run.yaml</code> to <code class="language-plaintext highlighter-rouge">/tmp/test-vllm-llama-stack/vllm-llama-stack-run-k8s.yaml</code> with the following inference provider:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>providers:
inference:
- provider_id: vllm
provider_type: remote::vllm
config:
url: http://vllm-server.default.svc.cluster.local:8000/v1
max_tokens: 4096
api_token: fake
</code></pre></div></div>
<p>Once we have defined the run configuration for Llama Stack, we can build an image with that configuration and the server source code:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cat >/tmp/test-vllm-llama-stack/Containerfile.llama-stack-run-k8s <<EOF
FROM distribution-myenv:dev
RUN apt-get update && apt-get install -y git
RUN git clone https://github.com/meta-llama/llama-stack.git /app/llama-stack-source
ADD ./vllm-llama-stack-run-k8s.yaml /app/config.yaml
EOF
podman build -f /tmp/test-vllm-llama-stack/Containerfile.llama-stack-run-k8s -t llama-stack-run-k8s /tmp/test-vllm-llama-stack
</code></pre></div></div>
<p>We can then start the Llama Stack server by deploying a Kubernetes Pod and Service:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cat <<EOF |kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llama-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
name: llama-stack-pod
labels:
app: llama-stack
spec:
containers:
- name: llama-stack
image: localhost/llama-stack-run-k8s:latest
imagePullPolicy: IfNotPresent
command: ["python", "-m", "llama_stack.distribution.server.server", "--yaml-config", "/app/config.yaml"]
ports:
- containerPort: 5000
volumeMounts:
- name: llama-storage
mountPath: /root/.llama
volumes:
- name: llama-storage
persistentVolumeClaim:
claimName: llama-pvc
---
apiVersion: v1
kind: Service
metadata:
name: llama-stack-service
spec:
selector:
app: llama-stack
ports:
- protocol: TCP
port: 5000
targetPort: 5000
type: ClusterIP
EOF
</code></pre></div></div>
<p>We can check that the Llama Stack server has started:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ kubectl logs vllm-server
...
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: ASGI 'lifespan' protocol appears unsupported.
INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)
</code></pre></div></div>
<p>Now let’s forward the Kubernetes service to a local port and test some inference requests against it via the Llama Stack Client:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl port-forward service/llama-stack-service 5000:5000
llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"
</code></pre></div></div>
<p>You can learn more about different providers and functionalities of Llama Stack on <a href="https://llama-stack.readthedocs.io">the official documentation</a>.</p>
<h2 id="acknowledgement">Acknowledgement</h2>
<p>We’d like to thank the Red Hat AI Engineering team for the implementation of the vLLM inference providers, contributions to many bug fixes, improvements, and key design discussions. We also want to thank the Llama Stack team from Meta and the vLLM team for their timely PR reviews and bug fixes.</p>]]></content><author><name>Yuan Tang (Red Hat) and Ashwin Bharambe (Meta)</name></author><summary type="html"><![CDATA[We are excited to announce that vLLM inference provider is now available in Llama Stack through the collaboration between the Red Hat AI Engineering team and the Llama Stack team from Meta. This article provides an introduction to this integration and a tutorial to help you get started using it locally or deploying it in a Kubernetes cluster.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.vllm.ai/assets/figures/llama-stack/llama-stack.png" /><media:content medium="image" url="https://blog.vllm.ai/assets/figures/llama-stack/llama-stack.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">vLLM V1: A Major Upgrade to vLLM’s Core Architecture</title><link href="https://blog.vllm.ai/2025/01/27/v1-alpha-release.html" rel="alternate" type="text/html" title="vLLM V1: A Major Upgrade to vLLM’s Core Architecture" /><published>2025-01-27T00:00:00-08:00</published><updated>2025-01-27T00:00:00-08:00</updated><id>https://blog.vllm.ai/2025/01/27/v1-alpha-release</id><content type="html" xml:base="https://blog.vllm.ai/2025/01/27/v1-alpha-release.html"><![CDATA[<p align="center">
<picture>
<img src="/assets/figures/v1/vLLM_V1_Logo.png" width="80%" />
</picture>
</p>
<p>We are thrilled to announce the <strong>alpha release of vLLM V1</strong>, a major upgrade to vLLM’s core architecture. Based on lessons we learned over the past 1.5 years of vLLM development, we revisited key design decisions, consolidated various features, and simplified the codebase to enhance flexibility and scalability. V1 already achieves <strong>state-of-the-art performance</strong> and is set to gain even more optimizations. Best of all, users can enable V1 seamlessly—just set the <code class="language-plaintext highlighter-rouge">VLLM_USE_V1=1</code> environment variable <strong>without any changes to the existing API</strong>. After testing and feedback collection in the coming weeks, we plan to transition V1 into the default engine.</p>
<h1 id="why-vllm-v1">Why vLLM V1?</h1>
<h2 id="learning-from-vllm-v0">Learning from vLLM V0</h2>
<p>Over the past 1.5 years, vLLM has achieved remarkable success in supporting diverse models, features, and hardware backends. However, while our community scaled horizontally, we faced challenges making the systems simple and integrating various optimizations vertically across the stack. Features were often developed independently, making it difficult to combine them effectively and cleanly. Over time, technical debt accumulated, prompting us to revisit our foundational design.</p>
<h2 id="goals-of-v1">Goals of V1</h2>
<p>Based on the above motivation, vLLM V1 is designed to:</p>
<ul>
<li>Provide a <strong>simple, modular, and easy-to-hack codebase</strong>.</li>
<li>Ensure <strong>high performance</strong> with near-zero CPU overhead.</li>
<li><strong>Combine key optimizations</strong> into a unified architecture.</li>
<li>Require <strong>zero configs</strong> by enabling features/optimizations by default.</li>
</ul>
<h2 id="scope-of-v1">Scope of V1</h2>
<p>vLLM V1 introduces a comprehensive re-architecture of its core components, including the scheduler, KV cache manager, worker, sampler, and API server. However, it still shares a lot of code with vLLM V0, such as model implementations, GPU kernels, distributed control plane, and various utility functions. This approach allows V1 to leverage the extensive coverage and stability established by V0 while delivering significant enhancements to performance and code complexity.</p>
<h1 id="whats-new-in-vllm-v1">What’s New in vLLM V1?</h1>
<h2 id="1-optimized-execution-loop--api-server">1. Optimized Execution Loop & API Server</h2>
<p align="center">
<picture>
<img src="/assets/figures/v1/v1_server_architecture.png" width="60%" />
</picture>
</p>
<p>As a full-fledged continuous batching engine and OpenAI-compatible API server, vLLM’s core execution loop relies on CPU operations to manage request states between model forward passes. As GPUs are getting faster and significantly reducing model execution times, the CPU overhead for tasks like running the API server, scheduling work, preparing inputs, de-tokenizing outputs, and streaming responses to users becomes increasingly pronounced. This issue is particularly noticeable with smaller models like Llama-8B running on NVIDIA H100 GPUs, where execution time on the GPU is as low as ~5ms.</p>
<p>In the <a href="https://blog.vllm.ai/2024/09/05/perf-update.html">v0.6.0 release</a>, vLLM introduced a multiprocessing API server utilizing ZeroMQ for IPC, enabling overlap between the API server and AsyncLLM. vLLM V1 extends this by integrating the multiprocessing architecture deeper into the core of AsyncLLM, creating an isolated <code class="language-plaintext highlighter-rouge">EngineCore</code> execution loop that focuses exclusively on the scheduler and model executor. This design allows for greater overlap of CPU-intensive tasks—such as tokenization, multimodal input processing, de-tokenization, and request streaming—with the core execution loop, thereby maximizing model throughput.</p>
<h2 id="2-simple--flexible-scheduler">2. Simple & Flexible Scheduler</h2>
<p align="center">
<picture>
<img src="/assets/figures/v1/v1_scheduling.png" width="60%" />
</picture>
</p>
<p>vLLM V1 introduces a simple yet flexible scheduler. It removes the traditional distinction between “prefill” and “decode” phases by treating user-given prompt tokens and model-generated output tokens uniformly. Scheduling decisions are represented as a simple dictionary, e.g., <code class="language-plaintext highlighter-rouge">{request_id: num_tokens}</code>, which specifies the number of tokens to process for each request at each step. We find that this representation is general enough to support features such as chunked prefills, prefix caching, and speculative decoding. For instance, chunked-prefill scheduling is seamlessly implemented: with a fixed token budget, the scheduler dynamically decides how many tokens to allocate to each request (as shown in the figure above).</p>
<h2 id="3-zero-overhead-prefix-caching">3. Zero-Overhead Prefix Caching</h2>
<p>vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1’s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%.</p>
<p align="center">
<picture>
<img src="/assets/figures/v1/v1_prefix_caching.png" width="90%" />
</picture>
</p>
<p>Here are some benchmark results. In our experiments, we observed that V1’s perfix caching causes less than 1% decrease in throughput even when the cache hit rate is 0%, while it improves the performance several times when the cache hit rate is high. <strong>Thanks to the near-zero overhead, we now enable prefix caching by default in V1.</strong></p>
<h2 id="4-clean-architecture-for-tensor-parallel-inference">4. Clean Architecture for Tensor-Parallel Inference</h2>
<p align="center">
<picture>
<img src="/assets/figures/v1/v1_tp_architecture.png" width="60%" />
</picture>
</p>
<p>vLLM V1 introduces a clean and efficient architecture for tensor-parallel inference, effectively addressing the limitations of V0. In V0, the scheduler and Worker 0 are colocated within the same process to reduce the inter-process communication overhead when broadcasting input data to workers. However, this design introduces an asymmetric architecture, increasing complexity. V1 overcomes this by caching request states on the worker side and transmitting only incremental updates (diffs) at each step. This optimization minimizes inter-process communication, allowing the scheduler and Worker 0 to operate in separate processes, resulting in a clean, symmetric architecture. Moreover, V1 abstracts away most distributed logic, enabling workers to operate the same way for both single-GPU and multi-GPU setups.</p>
<h2 id="5-efficient-input-preparation">5. Efficient Input Preparation</h2>
<p align="center">
<picture>
<img src="/assets/figures/v1/persistent_batch.png" width="50%" />
</picture>
</p>
<p>In vLLM V0, input tensors and metadata for the model are recreated at each step, often leading to significant CPU overhead. To optimize this, V1 implements the <a href="https://github.com/InternLM/lmdeploy">Persistent Batch</a> technique, which caches the input tensors and only applies the diffs to them at each step. Additionally, V1 minimizes the CPU overheads in updating the tensors by extensively utilizing Numpy operations instead of Python’s native ones.</p>
<h2 id="6-torchcompile-and-piecewise-cuda-graphs">6. torch.compile and Piecewise CUDA Graphs</h2>
<p align="center">
<picture>
<img src="/assets/figures/v1/torch_compile_cuda_graph.png" width="70%" />
</picture>
</p>
<p>V1 leverages vLLM’s <code class="language-plaintext highlighter-rouge">torch.compile</code> integration to automatically optimize the model. This allows V1 to efficiently support a wide variety of models while minimizing the need of writing custom kernels. Furthermore, V1 introduces <em>piecewise CUDA graphs</em> to alleviate the limitations of CUDA graphs. We are preparing dedicated blog posts on the torch.compile integration and piecewise CUDA graphs, so <strong>stay tuned for more updates</strong>!</p>
<h2 id="7-enhanced-support-for-multimodal-llms">7. Enhanced Support for Multimodal LLMs</h2>
<p>vLLM V1 treats multimodal large language models (MLLMs) as first-class citizens and introduces several key improvements in their support.</p>
<p>First, V1 optimizes multimodal input preprocessing by moving it to a non-blocking process. For example, image files (e.g., JPG or PNG) must be converted into tensors of pixel values, cropped, and transformed before being fed into the model. This preprocessing can consume significant CPU cycles, possibly leaving the GPU idle. To address this, V1 offloads the preprocessing task to a separate process, preventing it from blocking the GPU worker, and adds a preprocessing cache so that processed inputs can be reused across requests if they share the same multimodal input.</p>
<p>Second, V1 introduces prefix caching for multimodal inputs. In addition to the hash of token IDs, image hashes are used to identify the KV cache for image inputs. This improvement is especially beneficial for multi-turn conversations that include image inputs.</p>
<p>Third, V1 enables chunked-prefill scheduling for MLLMs with the “encoder cache.” In V0, image inputs and text inputs had to be processed in the same step because the LLM decoder’s <img /> token depends on the vision embeddings which are discarded after the step. With the encoder cache, V1 temporarily stores the vision embeddings, allowing the scheduler to split the text inputs into chunks and process them across multiple steps without needing to regenerate vision embeddings every step.</p>
<h2 id="8-flashattention-3">8. FlashAttention 3</h2>
<p>The final piece of the puzzle for vLLM V1 was integrating <a href="https://arxiv.org/abs/2407.08608">FlashAttention 3</a>. Given the high level of dynamism in V1—such as combining prefill and decode within the same batch—a flexible and high-performance attention kernel was essential. FlashAttention 3 effectively addresses this requirement, offering robust support for a wide range of features while maintaining excellent performance across diverse use cases.</p>
<h1 id="performance">Performance</h1>
<p>Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to <strong>1.7x higher throughput</strong> compared to V0 (<em>without multi-step scheduling</em>).
These dramatic performance gains stem from comprehensive CPU overhead reductions across the entire stack.
The improvements are even more pronounced for vision-language models (VLMs) like Qwen2-VL, thanks to V1’s enhanced support for VLMs.</p>
<ul>
<li><strong>Text Models: Llama 3.1 8B & Llama 3.3 70B</strong></li>
</ul>
<p align="center">
<picture>
<img src="/assets/figures/v1/v1_llama.png" width="100%" />
</picture>
</p>
<p>We measured the performance of vLLM V0 and V1 on Llama 3.1 8B and Llama 3.3 70B models using the ShareGPT dataset.
V1 demonstrated consistently lower latency than V0 especially at high QPS, thanks to the higher throughput it achieves.
Given that the kernels used for V0 and V1 are almost identical, the performance difference is mainly due to the architectural improvements (reduced CPU overheads) in V1.</p>
<ul>
<li><strong>Vision-language Models: Qwen2-VL</strong></li>
</ul>
<p align="center">
<picture>
<img src="/assets/figures/v1/v1_qwen2vl.png" width="60%" />
</picture>
</p>
<p>We evaluated the performance on VLMs by testing Qwen2-VL using the <a href="https://arxiv.org/abs/2412.08687">VisionArena</a> dataset.
V1 delivered even larger speedups over V0, thanks its improved VLM support, driven by two key improvements: offloading input processing to a separate process and implementing more flexible scheduling for multimodal queries.
We would also like to point out that prefix caching is now natively supported for multimodal models in V1, but will skip the benchmark results here.</p>
<ul>
<li><strong>Looking Forward</strong></li>
</ul>
<p>While these improvements are significant, we view them as just the beginning.
The redesigned architecture provies a solid foundation that will enable rapid development of new features.
We look forward to sharing additional enhancements in the coming weeks.
Stay tuned for more updates!</p>
<h1 id="limitations--future-work">Limitations & Future Work</h1>
<p>While vLLM V1 shows promising results, it is still in its alpha stage and lacks several features from V0. Here’s a clarification:</p>
<p><strong>Model Support:</strong><br />
V1 supports decoder-only Transformers like Llama, mixture-of-experts (MoE) models like Mixtral, and several VLMs such as Qwen2-VL. All quantization methods are supported. However, V1 currently does not support encoder-decoder architectures like multimodal Llama 3.2, Mamba-based models like Jamba, or embedding models. Please check out <a href="https://docs.vllm.ai/en/latest/models/supported_models.html">our documentation</a> for a more detailed list of the supported models.</p>
<p><strong>Feature Limitations:</strong><br />
V1 currently lacks support for log probs, prompt log probs sampling parameters, pipeline parallelism, structured decoding, speculative decoding, prometheus metrics, and LoRA. We are actively working to close this feature gap and add brand-new optimizations to the V1 engine.</p>
<p><strong>Hardware Support:</strong><br />
V1 currently supports only Ampere or later NVIDIA GPUs. We are actively working to extend support to other hardware backends such as TPU.</p>
<p>Finally, please note that you can continue using V0 and maintain backward compatibility by not setting <code class="language-plaintext highlighter-rouge">VLLM_USE_V1=1</code>.</p>
<h1 id="how-to-get-started">How to Get Started</h1>
<p>To use vLLM V1:</p>
<ol>
<li>Install the latest version of vLLM with <code class="language-plaintext highlighter-rouge">pip install vllm --upgrade</code>.</li>
<li><strong>Set the environment variable <code class="language-plaintext highlighter-rouge">export VLLM_USE_V1=1</code>.</strong></li>
<li>Use vLLM’s <a href="https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/basic.py">Python API</a> or OpenAI-compatible server (<code class="language-plaintext highlighter-rouge">vllm serve <model-name></code>). You don’t need any change to the existing API.</li>
</ol>
<p>Please try it out and share your feedback!</p>
<h1 id="acknowledgment">Acknowledgment</h1>
<p>We gratefully acknowledge that the design of vLLM V1 builds upon and enhances several open-source LLM inference engines, including <a href="https://github.com/ModelTC/lightllm">LightLLM</a>, <a href="https://github.com/InternLM/lmdeploy">LMDeploy</a>, <a href="https://github.com/sgl-project/sglang">SGLang</a>, <a href="https://github.com/huggingface/text-generation-inference">TGI</a>, and <a href="https://github.com/NVIDIA/TensorRT-LLM">TRT-LLM</a>. These engines have significantly influenced our work, and we have gained valuable insights from them.</p>
<p>The V1 re-architecture is a continued joint effort across the entire vLLM team and community. Below is an incomplete list of contributors to this milestone:</p>
<ul>
<li>UC Berkeley, Neural Magic (now Red Hat), Anyscale, and Roblox mainly drove the effort together.</li>
<li><a href="https://github.com/WoosukKwon">Woosuk Kwon</a> initiated the project and implemented the scheduler and model runner.</li>
<li><a href="https://github.com/robertgshaw2-redhat">Robert Shaw</a> implemented the optimized execution loop and API server.</li>
<li><a href="https://github.com/comaniac">Cody Yu</a> implemented efficient prefix caching for text and image inputs.</li>
<li><a href="https://github.com/ywang96">Roger Wang</a> led the overall enhanced MLLM support in V1.</li>
<li><a href="https://github.com/youkaichao">Kaichao You</a> led the torch.compile integration and implemented the piecewise CUDA graphs.</li>
<li><a href="https://github.com/tlrmchlsmth">Tyler Michael Smith</a> implemented the tensor parallelism support with Python multiprocessing.</li>
<li><a href="https://github.com/ruisearch42">Rui Qiao</a> implemented the tensor parallelism support with Ray and is implementing pipeline parallelism support.</li>
<li><a href="https://github.com/LucasWilkinson">Lucas Wilkinson</a> added support for FlashAttention 3.</li>
<li><a href="https://github.com/alexm-redhat">Alexander Matveev</a> implemented the optimized preprocessor for multimodal inputs and is implementing TPU support.</li>
<li><a href="https://github.com/sroy745">Sourashis Roy</a> implemented the logit penalties in the sampler.</li>
<li><a href="https://github.com/DarkLight1337">Cyrus Leung</a> led the MLLM input processing refactoring effort and helped its integration to V1.</li>
<li><a href="https://github.com/russellb">Russell Bryant</a> addressed several multiprocess-related issues.</li>
<li><a href="https://github.com/njhill">Nick Hill</a> optimized the engine loop and API server.</li>
<li><a href="https://github.com/rickyyx">Ricky Xu</a> and <a href="https://github.com/heheda12345">Chen Zhang</a> helped refactor the KV cache manager.</li>
<li><a href="https://github.com/jeejeelee">Jie Li</a> and <a href="https://github.com/mgoin">Michael Goin</a> helped with MLLM support and optimization.</li>
<li><a href="https://github.com/aarnphm">Aaron Pham</a> is implementing the structured decoding support.</li>
<li><a href="https://github.com/varun-sundar-rabindranath">Varun Sundar Rabindranath</a> is implementing the multi-LoRA support.</li>
<li><a href="https://github.com/afeldman-nm">Andrew Feldman</a> is implementing the log probs and prompt log probs support.</li>
<li><a href="https://github.com/LiuXiaoxuanPKU">Lily Liu</a> is implementing the speculative decoding support.</li>
<li><a href="https://github.com/KuntaiDu">Kuntai Du</a> is implementing the prefill disaggregation and KV Cache transfer support.</li>
<li><a href="https://github.com/simon-mo">Simon Mo</a> and <a href="https://github.com/zhuohan123">Zhuohan Li</a> contributed to the V1 system design.</li>
</ul>]]></content><author><name>vLLM Team</name></author><summary type="html"><![CDATA[]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.vllm.ai/assets/figures/v1/vLLM_V1_Logo.png" /><media:content medium="image" url="https://blog.vllm.ai/assets/figures/v1/vLLM_V1_Logo.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”</title><link href="https://blog.vllm.ai/2025/01/21/stack-release.html" rel="alternate" type="text/html" title="High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”" /><published>2025-01-21T00:00:00-08:00</published><updated>2025-01-21T00:00:00-08:00</updated><id>https://blog.vllm.ai/2025/01/21/stack-release</id><content type="html" xml:base="https://blog.vllm.ai/2025/01/21/stack-release.html"><![CDATA[<p><br /></p>
<h2 id="tldr">TL;DR</h2>
<ul>
<li><strong>vLLM</strong> boasts the largest open-source community, but what does it take to transform vLLM from the best single-node LLM engine to a premier LLM serving system?</li>
<li><strong>Today, we release “vLLM production-stack”</strong>, a vLLM-based full inference stack that introduces two major advantages:
<ul>
<li><strong>10x better performance</strong> (3-10x lower response delay & 2-5x higher throughput) with prefix-aware request routing and KV-cache sharing.</li>
<li><strong>Easy cluster deployment</strong> with built-in support for fault tolerance, autoscaling, and observability.</li>
</ul>
</li>
<li>And the best part? It’s <strong>open-source</strong>—so everyone can get started right away! <a href="https://github.com/vllm-project/production-stack">[<strong>https://github.com/vllm-project/production-stack</strong>]</a></li>
</ul>
<h1 id="the-context">The Context</h1>
<!-- Over the past year, LLM inference has raced to the forefront, powering everything from chatbots to code assistants and beyond. It’s quickly becoming critical infrastructure, much like the cloud was to big data, cellular was to mobile apps, and CDNs were (and still are!) to the broader Internet. -->
<p><em>In the AI arms race, it’s no longer just about who has the best model—it’s about <strong>who has the best LLM serving system</strong>.</em></p>
<p><strong>vLLM</strong> has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on <strong>single-node</strong> deployments.</p>
<p>How do we extend its power into a <strong>full-stack</strong> inference system that any organization can deploy at scale with <em>high reliability</em>, <em>high throughput</em>, and <em>low latency</em>? That’s precisely why the LMCache team and the vLLM team built <strong>vLLM production-stack</strong>.</p>
<div align="center">
<img src="/assets/figures/stack/stack-thumbnail.png" alt="Icon" style="width: 60%; vertical-align:middle;" />
</div>
<h1 id="introducing-vllm-production-stack">Introducing “<em>vLLM Production-Stack</em>”</h1>
<p><strong>vLLM Production-stack</strong> is an open-source <strong>reference implementation</strong> of an <strong>inference stack</strong> built on top of vLLM, designed to run seamlessly on a cluster of GPU nodes. It adds four critical functionalities that complement vLLM’s native strengths:</p>
<ul>
<li><strong>KV cache sharing & storage</strong> to speed up inference when context is reused (powered by the <a href="https://github.com/LMCache/LMCache"><strong>LMCache</strong></a> project).</li>
<li><strong>Prefix-aware routing</strong> that sends queries to the vLLM instance already holding the relevant context KV cache.</li>
<li><strong>Observability</strong> of individual engine status and query-level metrics (TTFT, TBT, throughput).</li>
<li><strong>Autoscaling</strong> to handle dynamics of workloads.</li>
</ul>
<h3 id="comparison-with-alternatives">Comparison with Alternatives:</h3>
<p>Below is a quick snapshot comparing vLLM production-stack with its closest counterparts:</p>
<div align="center">
<img src="/assets/figures/stack/stack-table.png" alt="Icon" style="width: 90%; vertical-align:middle;" />
</div>
<h3 id="the-design">The Design</h3>
<p>The vLLM production-stack architecture builds on top of vLLM’s powerful single-node engine to provide a cluster-wide solution.</p>
<p>At a high level:</p>
<ul>
<li>Applications send LLM inference requests.</li>
<li>Prefix-aware routing checks if the requested context is already cached within the memory pool of one instance. It then forwards the request to the node with the pre-computed cache.</li>
<li>Autoscaling and a cluster manager watch the overall load and spin up new vLLM nodes if needed.</li>
<li>Observability modules gather metrics like TTFT (Time-To-First-Token), TBT (Time-Between-Tokens), and throughput, giving you real-time insights into your system’s health.</li>
</ul>
<div align="center">
<img src="/assets/figures/stack/stack-overview-2.png" alt="Icon" style="width: 90%; vertical-align:middle;" />
</div>
<h1 id="advantage-1-easy-deployment">Advantage #1: Easy Deployment</h1>
<p>Use helm chart to deploy the vLLM production-stack to your k8s cluster through running a single command:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo helm repo add llmstack-repo https://lmcache.github.io/helm/ &&\
sudo helm install llmstack llmstack-repo/vllm-stack
</code></pre></div></div>
<p>For more details, please refer to the detailed README at <a href="https://github.com/vllm-project/production-stack">vLLM production-stack repo</a>. <a href="https://github.com/LMCache/LMStack/tree/main/tutorials">Tutorials</a> about setting up k8s cluster and customizing helm charts are also available.</p>
<h1 id="advantage-2-better-performance">Advantage #2: Better Performance</h1>
<p>We conduct a benchmark of multi-round Q&A workload on vLLM production-stack and other setups, including vLLM + KServe and an commercial endpoint service.
The results show vLLM stack outperforms other setups across key metrics (time to first token and inter token latency).</p>
<div align="center">
<img src="/assets/figures/stack/stack-ttft.png" alt="Icon" style="width: 60%; vertical-align:middle;" />
</div>
<div align="center">
<img src="/assets/figures/stack/stack-itl.png" alt="Icon" style="width: 60%; vertical-align:middle;" />
</div>
<h1 id="advantage-3-effortless-monitoring">Advantage #3: Effortless Monitoring</h1>
<p>Keep real-time tracking of your LLM inference cluster with key metrics including latency distributions, number of requests over time, KV cache hit rate.</p>
<div align="center">
<img src="/assets/figures/stack/stack-panel.png" alt="Icon" style="width: 70%; vertical-align:middle;" />
</div>
<h2 id="conclusion">Conclusion</h2>
<p>We’re thrilled to unveil <strong>vLLM Production Stack</strong>—the next step in transforming vLLM from a best-in-class single-node engine into a full-scale LLM serving system.
We believe the vLL stack will open new doors for organizations seeking to build, test, and deploy LLM applications at scale without sacrificing performance or simplicity.</p>
<p>If you’re as excited as we are, don’t wait!</p>
<ul>
<li><strong>Clone the repo: <a href="https://github.com/vllm-project/production-stack">https://github.com/vllm-project/production-stack</a></strong></li>
<li><strong>Kick the tires</strong></li>
<li><strong>Let us know what you think!</strong></li>
<li><strong><a href="https://forms.gle/mQfQDUXbKfp2St1z7">Interest Form</a></strong></li>
</ul>
<p>Join us to build a future where every application can harness the power of LLM inference—reliably, at scale, and without breaking a sweat.
<em>Happy deploying!</em></p>
<p>Contacts:</p>
<ul>
<li><strong>vLLM <a href="https://slack.vllm.ai/">slack</a></strong></li>
<li><strong>LMCache <a href="https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ">slack</a></strong></li>
</ul>]]></content><author><name>LMCache Team</name></author><summary type="html"><![CDATA[]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.vllm.ai/assets/figures/stack/stack-thumbnail.png" /><media:content medium="image" url="https://blog.vllm.ai/assets/figures/stack/stack-thumbnail.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Structured Decoding in vLLM: a gentle introduction</title><link href="https://blog.vllm.ai/2025/01/14/struct-decode-intro.html" rel="alternate" type="text/html" title="Structured Decoding in vLLM: a gentle introduction" /><published>2025-01-14T00:00:00-08:00</published><updated>2025-01-14T00:00:00-08:00</updated><id>https://blog.vllm.ai/2025/01/14/struct-decode-intro</id><content type="html" xml:base="https://blog.vllm.ai/2025/01/14/struct-decode-intro.html"><![CDATA[<p><strong>TL/DR</strong>:</p>
<ul>
<li>Structured decoding allows precise control over LLM output formats</li>
<li>vLLM now supports both <a href="https://github.com/dottxt-ai/outlines">outlines</a> and <a href="https://github.com/mlc-ai/xgrammar">XGrammar</a> backends for structured decoding</li>
<li>Recent XGrammar integration brings up to 5x improvement in time per output token (TPOT) under load</li>
<li>Upcoming v1 release focuses on enhanced performance and schedule-level mask broadcasting for mixed-requests batch support</li>
</ul>
<p><em><a href="https://blog.vllm.ai/2023/06/20/vllm.html">vLLM</a> is the high-throughput and efficient inference engine for running <strong>large-language models</strong> (LLMs). In this post, we will explore the annotated history of language models, describe the current state of structured decoding in vLLM, as well as the recent integration with <a href="https://github.com/vllm-project/vllm/pull/10785">XGrammar</a>, and <a href="https://github.com/vllm-project/vllm/issues/8779">share our tentative roadmap for future improvements</a>.</em></p>
<blockquote>
<p>We would also invite users to tackle this blog post from a philosophical perspective, and in the process trying to posit that structured decoding represents a fundamental shift in how we think about LLM outputs. It also plays an important role in building complex agentic system.</p>
</blockquote>
<p>For more information about vLLM, please check out our <a href="https://docs.vllm.ai/en/latest/">documentation</a>.</p>
<h2 id="language-models-a-brief-historical-context">Language models: A brief historical context</h2>
<p>In 1950, Alan Turing proposed that a high-speed digital computer, programmed with rules, could exhibit emergent behaviour of intelligence (Turing, 1950). This led to two main approaches in AI development:</p>
<ol>
<li>
<p>Good Old-Fashioned AI (GOFAI): A paradigm quickly emerged among researchers in the 1950s, where expert systems were designed to replicate the decision-making capabilities of a human specialist<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, (or symbolic reasoning system), referred to by Haugland as Good Old-Fashioned AI (GOFAI) (Haugeland, 1997). However, it quickly ran into funding problems due to its semantic representation not being able to scale up to generalised tasks (Also known as the “AI Winter” (Hendler, 2008)).</p>
</li>
<li>
<p>New-Fangled AI (NFAI): Concurrently, Donald Norman’s Parallel Distributed Processing (Rumelhart et al., 1986) group investigated variations of Rosenblatt’s perception (Rosenblatt, 1958), where they proposed <em>hidden layers</em> within the network alongside with inputs and outputs to extrapolate appropriate responses based on what it had learned during training process. These connectionist networks were often built on top of statistical methods<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>. Given the abundance of data and Moore’s Law<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> resulting in an unprecedented amount of compute available, we see the complete dominance of connectionist networks in both research and production use-cases, most notably variants of <em>decoder-only</em> transformers<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> for <em>text generations</em> tasks. As such, most modern transformers variants are considered <strong>NFAI</strong> systems.</p>
</li>
</ol>
<p>In summary:</p>
<ul>
<li>GOFAI are <em>deterministic</em> and rule-based, given its intentionality is injected through explicit programming</li>
<li>NFAI are often considered as “black-box” models (in: input - out: some output), data-driven given the networked complexity nature of its internal representations</li>
</ul>
<h2 id="why-do-we-need-structured-decoding">Why do we need structured decoding?</h2>
<figure>
<img src="/assets/figures/struct-decode-intro/shogoth-gpt.png" />
<figcaption>
Shogoth as GPTs. In a sense, RLHF, or any post-training methods, is an injection of rules (a GOFAI system) into any large compound AI systems
</figcaption>
</figure>
<p>LLMs excel at the following heuristic: given a blob of text, the model will generate a contiguous piece of text that it predicts as the most probable tokens. For example, if you give it a Wikipedia article, the model should produce text consistent with the remainder of said article.</p>
<p>These models work well given the following assumption: the input prompt must be coherent and well-structured surrounding a given problem the users want to achieve. In other words, LLMs can be unpredictable when you need output in specific formats. Think of asking a model to generate JSON - without guidance, it might produce valid text that breaks JSON specification<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup>.</p>
<p>This is where structured decoding comes in. It enables LLMs to generate outputs that follow a desired structure while preserving the non-deterministic nature of the system.</p>
<p>Companies like OpenAI have recognized this need, implementing features like <a href="https://platform.openai.com/docs/guides/structured-outputs#json-mode">JSON mode</a> to constrain<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup> the output format. If you have built with these functionalities before (such as agentic workflows, function calling, coding assistant), chances are you are using structured decoding under the hood.</p>
<blockquote>
<p>Guided decoding is to LLMs what <strong>validation</strong> is to APIs - it acts as a guarantee that what comes out matches what you expect. Guided decoding ensures structure integrity that allows developers to integrate LLMs into their application with ease!</p>
</blockquote>
<h2 id="structured-decoding-and-vllm">Structured decoding and vLLM</h2>
<p>In simple terms, structured decoding gives LLMs a “template” to follow. Users provide a schema that “influences” the model’s output, ensuring compliance with the desired structure:</p>
<p><img src="/assets/figures/struct-decode-intro/mermaid-intro.svg" alt="top level view of structure decoding" /></p>
<p>From a technical perspective, an inference engine can modify the probability distribution for next-tokens by applying bias (often via logit masks) for all tokens from any given schemas. To apply these biases, <a href="https://github.com/dottxt-ai/outlines">outlines</a> proposed guided generations via finite-state machine (FSM) for any given schemas (Willard & Louf, 2023). This allows us to track the current state during decoding and filter out invalid tokens by applying logit bias to the output.</p>
<figure>
<img src="/assets/figures/struct-decode-intro/constrained-json-fsm.webp" />
<figcaption>
courtesy of <a href="https://lmsys.org/blog/2024-02-05-compressed-fsm/" target="_blank">LMSys, 2024</a>.
</figcaption>
</figure>
<p><em>in vLLM, you can use this by passing a JSON schema to the sampling params (either through Python SDK or HTTP requests).</em></p>
<blockquote>
<p>Note: in some cases, it can even <a href="https://blog.dottxt.co/coalescence.html">improve</a> the native decoding performance for LLM!</p>
</blockquote>
<h3 id="previous-limitations-in-vllm">Previous limitations in vLLM</h3>
<p>There are few limitations with current vLLM’s support of the Outlines backend:</p>
<ol>
<li><strong>Slow decoding</strong>: FSM has to be constructed at a token-level, meaning it can only transition the state one token per step. Therefore, it can only decode <em>one</em> token at a time, resulting in slow decoding.</li>
<li><strong>Batch processing bottlenecks</strong>: Implementation in <a href="https://github.com/vllm-project/vllm/blob/80c751e7f68ade3d4c6391a0f3fce9ce970ddad0/vllm/model_executor/guided_decoding/outlines_logits_processors.py">vLLM</a> relies heavily on logit processor<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup>. As such, this is on the critical path of the sampling process. In batching use-case, compiling FSM per requests as well as computing the mask synchronous means that <strong>all requests</strong> in any given batches will get blocked, resulting in high time-to-first-tokens (TTFT) and lower throughput.
<ul>
<li>We found that compiling FSM is proven to be a relatively expensive task, making it a significant contributor to the increased TTFT.</li>
</ul>
</li>
<li><strong>Performance issues with CFG mode</strong>: With outlines integrations, while JSON mode is relatively fast, the CFG mode runs significantly slower, and can occasionally <a href="https://github.com/vllm-project/vllm/issues/10081">crashes</a> the engine.</li>
<li><strong>Limited advanced feature support</strong>: Techniques like <a href="https://lmsys.org/blog/2024-02-05-compressed-fsm/">jump-forward decoding</a> are currently not possible with logit-processor approach. It requires prefilling a set of k-next tokens, whereas for logit processors we can only deal with the next-token.</li>
</ol>
<h3 id="integration-with-xgrammar">Integration with XGrammar</h3>
<p><a href="https://github.com/mlc-ai/xgrammar">XGrammar</a> introduces a new technique that batch constrained decoding via pushdown automaton (PDA). You can think of a PDA as a “collection of FSMs, and each FSM represents a context-free grammar (CFG).” One significant advantage of PDA is its recursive nature, allowing us to execute multiple state transitions. They also include additional <a href="https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar">optimisation</a> (for those who are interested) to reduce grammar compilation overhead.</p>
<p>This advancement addresses <strong>limitation (1)</strong> by moving grammar compilation out of Python into C, utilising <code class="language-plaintext highlighter-rouge">pthread</code>. Additionally, XGrammar lays the groundwork for addressing <strong>limitation (4)</strong> in future releases. Below are performance comparisons between the XGrammar and Outlines backends:</p>
<figure>
<img src="/assets/figures/struct-decode-intro/vllm-new-xgrammar.png" />
<img src="/assets/figures/struct-decode-intro/vllm-xgrammar-decode-time-per-output-token.png" />
<figcaption>
courtesy of Michael Goin (Red Hat).
</figcaption>
</figure>
<p>In vLLM’s v0 architecture, we’ve implemented XGrammar as a <a href="https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/guided_decoding/xgrammar_decoding.py">logit processor</a>, optimizing it with caching for tokenizer data. While the performance improvements are encouraging, we believe there’s still significant room for optimization.</p>
<p>There are still a few usability concerns in XGrammar v0 integration to match feature parity with all use cases:</p>
<ul>
<li>It is yet to support grammars other than GBNF format (PR on vLLM: <a href="https://github.com/vllm-project/vllm/pull/10870">github</a>)</li>
<li>It is yet to support regex</li>
<li>It is yet to support complex JSON that uses regex patterns or numeric ranges
<ul>
<li>There are a few PR trying to cover this usage. There was one <a href="https://github.com/vllm-project/vllm/pull/10899">bugfix PR on vLLM</a> and one <a href="https://github.com/mlc-ai/xgrammar/pull/106">upstream</a></li>
</ul>
</li>
</ul>
<blockquote>
<p>vLLM now has a basic support for XGrammar by default. In case where we know XGrammar is insufficient to serve the request, we fall back to Outlines.</p>
<p>Note that vLLM also includes support for lm-format-enforcer. However, from our testing we found that in some long context test cases, lm-format-enforcer fails to enforce correct outputs, and not up to par with Outlines in terms of performance.</p>
</blockquote>
<h2 id="tentative-plans-for-v1">Tentative plans for v1</h2>
<p>With the release of <a href="https://github.com/vllm-project/vllm/issues/8779">v1</a> on the horizon, we’re working on a tentative plan for structured decoding:</p>
<ol>
<li>Moving guided decoding towards scheduler-level:
<ul>
<li>Reason: We have more context regarding which requests that use structured decoding at a scheduler-level, therefore it shouldn’t block other requests within the batch (tentatively addressing <strong>limitation (2)</strong>). In a sense, this moves guided decoding outside of the critical path.</li>
<li>This would allow for more natural vertical integration with jump-forward decoding (address <strong>limitation (4)</strong>).</li>
</ul>
</li>
<li>Allowing bit-mask calculation in one process instead of each GPU workers
<ul>
<li>Reason: We can broadcast this bit-mask to each GPU worker instead of repeating this process per GPU worker.</li>
<li>We will look to carefully analyze the bandwidth implications of broadcasting masks for every sample per request that use guided decoding.</li>
</ul>
</li>
<li>Good baseline for speculative decoding and tool-use
<ul>
<li>Reason: XGrammar includes plans to support tool-use, such that we can move away from Python’s <a href="https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/openai/tool_parsers">tool parser</a>.</li>
<li>Tree scoring in speculative decoding can then use the same API as jump-forward decoding (which depends on the integration of guided decoding at the scheduler level).</li>
</ul>
</li>
</ol>
<p><em>NOTE: if you have any more suggestions we are more than happy to take it into consideration. Consider joining <a href="https://www.notion.so/bentoml/slack.vllm.ai">vLLM slack</a> via <code class="language-plaintext highlighter-rouge">#feat-structured-output</code>.</em></p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>We want to thank the vLLM team, XGrammar team, <a href="https://github.com/aarnphm">Aaron Pham (BentoML)</a>, <a href="https://github.com/mgoin">Michael Goin (Red Hat)</a>, <a href="https://github.com/xuechendi">Chendi Xue (Intel)</a>, and <a href="https://github.com/russellb">Russell Bryant (Red Hat)</a> for their valuable feedback and collaboration on bringing XGrammar to vLLM and the continuous effort to improve structured decoding in vLLM.</p>
<h2 id="references">References</h2>
<ul>
<li>Bahdanau, D., Cho, K., & Bengio, Y. (2016). <em>Neural Machine Translation by Jointly Learning to Align and Translate</em>. arXiv preprint arXiv:1409.0473</li>
<li>Haugeland, J. (1997). <em>Mind Design II: Philosophy, Psychology, and Artificial Intelligence</em>. The MIT Press. <a href="https://doi.org/10.7551/mitpress/4626.001.0001">https://doi.org/10.7551/mitpress/4626.001.0001</a></li>
<li>Hendler, J. (2008). Avoiding Another AI Winter. <em>IEEE Intelligent Systems</em>, <em>23</em>(2), 2–4. <a href="https://doi.org/10.1109/MIS.2008.20">https://doi.org/10.1109/MIS.2008.20</a></li>
<li>Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. <em>Neural Computation</em>.</li>
<li>Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). <em>Scaling Laws for Neural Language Models</em>. arXiv preprint arXiv:2001.08361</li>
<li>Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). <em>Efficient Estimation of Word Representations in Vector Space</em>. arXiv preprint arXiv:1301.3781</li>
<li>Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. <em>Psychological Review</em>, <em>65</em>(6), 386–408. <a href="https://doi.org/10.1037/h0042519">https://doi.org/10.1037/h0042519</a></li>
<li>Rumelhart, D. E., McClelland, J. L., & Group, P. R. (1986). <em>Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations</em>. The MIT Press. <a href="https://doi.org/10.7551/mitpress/5236.001.0001">https://doi.org/10.7551/mitpress/5236.001.0001</a></li>
<li>Shortliffe, E. H. (1974). <em>MYCIN: A Rule-Based Computer Program for Advising Physicians Regarding Antimicrobial Therapy Selection</em> (Technical Report STAN-CS-74-465). Stanford University.</li>
<li>Statistical Machine Translation. (n.d.). <em>IBM Models</em>. Statistical Machine Translation Survey. <a href="http://www2.statmt.org/survey/Topic/IBMModels">http://www2.statmt.org/survey/Topic/IBMModels</a></li>
<li>Turing, A. M. (1950). i.—Computing Machinery And Intelligence. <em>Mind</em>, <em>LIX</em>(236), 433–460. <a href="https://doi.org/10.1093/mind/LIX.236.433">https://doi.org/10.1093/mind/LIX.236.433</a></li>
<li>Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2023). <em>Attention Is All You Need</em>. arXiv preprint arXiv:1706.03762</li>
<li>Willard, B. T., & Louf, R. (2023). <em>Efficient Guided Generation for Large Language Models</em>. arXiv preprint arXiv:2307.09702</li>
</ul>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Allen Newell and Herbert Simon’s work at RAND initially showed that computers can simulate important aspects of intelligence.</p>
<p>Another notable application was found in the medical domain (Haugeland, 1997). MYCIN, developed at Stanford University in the 1970s, diagnosed and recommended treatments for blood infections (Shortliffe, 1974). MYCIN’s developers recognized the importance of justifying recommendations, implementing what were known as “rule traces” to explain the system’s reasoning in human-understandable terms. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>In the 1990s, IBM released a sequence of complex statistical models that is trained to perform machine translations <a href="https://en.wikipedia.org/wiki/IBM_alignment_models">tasks</a> (Statistical Machine Translation, n.d.) (see also: this <a href="https://www.cs.cornell.edu/courses/cs5740/2017sp/lectures/08-alignments.pdf">lecture</a> from Cornell).</p>
<p>In 2001, Bag of words (BoW)-variants model was trained on 0.3B tokens and was considered SOTA at the time (Mikolov et al., 2013). These earlier works proved to the research community that statistical modelling triumphs over symbolic counterpart for language processing given it can capture the general patterns for large corpuses of text. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>In 2017, The landmark paper “Attention is all You Need” introduced Transformers architecture (Vaswani et al., 2023) for neural machine translations tasks, which is based on the attention mechanism first proposed by (Bahdanau et al., 2016).</p>
<p>OpenAI then introduced the scaling law for neural language models (Kaplan et al., 2020), which sets off the race towards building these systems based on foundational language models. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>Prior to Attention-based transformers, seq-to-seq models uses RNNs given its ability for longer context length and better memory. However, they are more susceptible to vanishing/exploding gradients comparing to feed-forward network, and thus LSTM (Hochreiter & Schmidhuber, 1997) was proposed to solve this problem. Yet, one of the main problems with LSTM is that they tend to have poor memory recall with data they have seen many steps ago.</p>
<p>The Attention paper addresses this problem by encoding additional positional data into the inputs. The paper also additionally proposed a encoder-decoder architecture for translation tasks, however, most of text-generation models nowadays are decoder-only, given its superior performance over zero-shot tasks.</p>
<p>One of the many reasons why attention-based transformers works better than LSTM is because transformers are very scalable and hardware-aware (you can’t just arbitrary add more LSTM block and hope for better long-term retention). For more information, please refer back to the original paper. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>One might argue that we can reliably achieve these through few-shot promptings, i.e “Give me a JSON that yields the address of users. Example output can be …”. However, there is no guarantee that the generated outputs is a valid JSON. This is because these models are probabilistic systems, as they are “sampling” the next results based on the distribution of data that it was trained on.</p>
<p>One might also argue that one should use specific fine-tuned models for JSON outputs to perform such cases. However, fine-tuning often requires extensive training and a lot more labor to curate data, monitor progress, and perform evaluation, which is a huge resources not everyone can afford to do. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>Note that the phrase “[structured/constrained/guided] decoding” are used interchangeably, but they all refer to the same mechanism of “using a format for the model to structurally sampling outputs.” <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>See this <a href="https://huggingface.co/blog/logits-processor-zoo">blog post</a> from HuggingFace for using logit processors to control the generation process. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>]]></content><author><name>Guest Post by BentoML and Red Hat</name></author><summary type="html"><![CDATA[TL/DR:]]></summary></entry><entry><title type="html">vLLM 2024 Retrospective and 2025 Vision</title><link href="https://blog.vllm.ai/2025/01/10/vllm-2024-wrapped-2025-vision.html" rel="alternate" type="text/html" title="vLLM 2024 Retrospective and 2025 Vision" /><published>2025-01-10T00:00:00-08:00</published><updated>2025-01-10T00:00:00-08:00</updated><id>https://blog.vllm.ai/2025/01/10/vllm-2024-wrapped-2025-vision</id><content type="html" xml:base="https://blog.vllm.ai/2025/01/10/vllm-2024-wrapped-2025-vision.html"><![CDATA[<p>The vLLM community achieved remarkable growth in 2024, evolving from a specialized inference engine to become the de facto serving solution for the open-source AI ecosystem. This transformation is reflected in our growth metrics:</p>
<ul>
<li>GitHub stars grew from 14,000 to 32,600 (2.3x)</li>
<li>Contributors expanded from 190 to 740 (3.8x)</li>
<li>Monthly downloads surged from 6,000 to 27,000 (4.5x)</li>
<li>GPU hours increased approximately 10x over the last six months</li>
<li>Explore more usage data at <a href="https://2024.vllm.ai">https://2024.vllm.ai</a></li>
</ul>
<p>vLLM has established itself as the leading open-source LLM serving and inference engine, with widespread adoption in production applications (e.g., powering Amazon Rufus and LinkedIn AI features). Our bi-monthly meetups have become strategic gatherings for partnerships with industry leaders like IBM, AWS, and NVIDIA, marking our progress toward becoming the universal serving solution for the open-source AI ecosystem. Read on for more details about vLLM’s 2024 achievements and 2025 roadmap!</p>
<p><em>This blog is based on the 16th session of the bi-weekly <a href="https://hubs.li/Q02TFDTT0">vLLM Office Hours</a>. Watch the recording <a href="https://www.youtube.com/watch?v=xmz8lHsrbGM">here</a>.</em></p>
<hr />
<h2 id="2024-achievements-scaling-models-hardware-and-features">2024 Achievements: Scaling Models, Hardware, and Features</h2>
<h3 id="community-contributions-and-growth">Community Contributions and Growth</h3>
<figure>
<img src="/assets/figures/vllm-2024-wrapped-2025-roadmap/vllm-contributor-groups.png" />
<figcaption>
vLLM Main Contributor Groups (by Commits)
</figcaption>
</figure>
<p>2024 was an exceptional year for vLLM! Our contribution community has expanded dramatically to include:</p>
<ul>
<li>15+ full-time contributors across 6+ organizations</li>
<li>20+ active organizations as key stakeholders and sponsors</li>
<li>Contributions from top institutions including UC Berkeley, Neural Magic, Anyscale, Roblox, IBM, AMD, Intel, and NVIDIA, as well as individual developers worldwide</li>
<li>A thriving ecosystem connecting model creators, hardware vendors, and optimization developers</li>
<li>Well-attended bi-weekly office hours facilitating transparency, community growth, and strategic partnerships</li>
</ul>
<p>These numbers reflect more than growth—they demonstrate vLLM’s role as critical infrastructure in the AI ecosystem, supporting everything from research prototypes to production systems serving millions of users.</p>
<h3 id="expanding-model-support">Expanding Model Support</h3>
<figure>
<img src="/assets/figures/vllm-2024-wrapped-2025-roadmap/model-architecture-serving-usage.png" />
<figcaption>
Usage by Model Architecture in Serving
</figcaption>
</figure>
<p>At the beginning of 2024, vLLM supported only a handful of models. By year’s end, the project had evolved to support performant inference for almost <a href="https://docs.vllm.ai/en/latest/models/supported_models.html"><strong>100 model architectures</strong></a>: spanning nearly every prominent open-source large language model (LLM), multimodal (image, audio, video), encoder-decoder, speculative decoding, classification, embedding, and reward models. Notably, vLLM introduced production support for state-space language models, exploring the future of non-transformer language models.</p>
<h3 id="broadening-hardware-compatibility">Broadening Hardware Compatibility</h3>
<figure>
<img src="/assets/figures/vllm-2024-wrapped-2025-roadmap/gpu-hours-by-vendor.png" />
<figcaption>
GPU Hours Breakdown by Hardware Vendor
</figcaption>
</figure>
<p>From the initial hardware target of NVIDIA A100 GPUs, vLLM has expanded to support:</p>
<ul>
<li><strong>NVIDIA GPUs:</strong> First-class optimizations for H100, with support for every NVIDIA GPU from V100 and newer.</li>
<li><strong>AMD GPUs:</strong> Support for MI200, MI300, and Radeon RX 7900 series - with rapidly growing adoption for MI300X.</li>
<li><strong>Google TPUs:</strong> Support for TPU v4, v5p, v5e, and the latest v6e.</li>
<li><strong>AWS Inferentia and Trainium:</strong> Supports for trn1/inf2 instances.</li>
<li><strong>Intel Gaudi (HPU) and GPU (XPU):</strong> Leveraging Intel GPU and Gaudi architectures for AI workloads.</li>
<li><strong>CPUs:</strong> Featuring support for a growing list of ISAs - x86, ARM, and PowerPC.</li>
</ul>
<p>vLLM’s hardware compatibility has broadened to address diverse user requirements while incorporating performance improvements. Importantly, vLLM is on the path to ensure that all models work on all hardware platforms, with all the optimizations enabled.</p>
<h3 id="delivering-key-features">Delivering Key Features</h3>
<figure>
<img src="/assets/figures/vllm-2024-wrapped-2025-roadmap/quantization-deployment-percentage.png" />
<figcaption>
Increasing Percentage of vLLM Deployments with Quantization
</figcaption>
</figure>
<p>vLLM’s 2024 development roadmap emphasized performance, scalability, and usability:</p>
<ul>
<li><strong>Weight and Activation Quantization:</strong> Added support for diverse quantization methods and kernels, enabling efficient inference across hardware platforms. Notable integrations include activation quantization for FP8+INT8, Marlin+Machete kernels for GPTQ/AWQ/wNa16, FP8 KV Cache, AQLM, QQQ, HQQ, bitsandbytes, and GGUF. Over 20% of vLLM deployments now use quantization.</li>
<li><strong>Automatic Prefix Caching:</strong> Reduced costs and improved latency for context-heavy applications.</li>
<li><strong>Chunked Prefill:</strong> Enhanced stability of inter-token latency for interactive applications.</li>
<li><strong>Speculative Decoding:</strong> Accelerated token generation through simultaneous token prediction and validation, supporting draft models, n-gram matching in prompts, and MLP speculators like Medusa or EAGLE.</li>
<li><strong>Structured Outputs:</strong> Provided high-performance capabilities for applications requiring specific formats like JSON or pydantic schemas.</li>
<li><strong>Tool Calling:</strong> Enabled models with supported chat templates to generate tool calls autonomously, facilitating data processing and agentic flows.</li>
<li><strong>Distributed Inference:</strong> Introduced pipeline parallelism and disaggregated prefill to effectively scale workloads across GPUs and nodes.</li>
</ul>