Nixl optimization for llama4 local attention #87

mgoin · 2025-05-15T14:35:25Z

No description provided.

…aft model to free ~1GB for llama 3 model (vllm-project#17326) Co-authored-by: root <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

Signed-off-by: mgoin <[email protected]>

Co-authored-by: Aaron Pham <[email protected]>

Signed-off-by: Lucas Wilkinson <[email protected]>

Signed-off-by: Aaron Pham <[email protected]> Co-authored-by: Russell Bryant <[email protected]>

Signed-off-by: mgoin <[email protected]>

…-project#17826) Signed-off-by: Jerry Zhang <[email protected]>

) Signed-off-by: Russell Bryant <[email protected]>

Signed-off-by: mgoin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]>

…ct#17945) Signed-off-by: Chen Zhang <[email protected]>

Signed-off-by: Mark McLoughlin <[email protected]>

Signed-off-by: Aaron Pham <[email protected]>

Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]>

…m-project#18154) Signed-off-by: Luka Govedič <[email protected]>

Signed-off-by: Harry Mellor <[email protected]>

…llm-project#18013) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]>

…#18091)

Signed-off-by: Andy Xie <[email protected]>

Signed-off-by: inkcherry <[email protected]>

…llm-project#18178) Signed-off-by: Mengqing Cao <[email protected]>

Signed-off-by: David Xia <[email protected]>

Signed-off-by: Russell Bryant <[email protected]>

Signed-off-by: omahs <[email protected]>

Signed-off-by: Harry Mellor <[email protected]>

…8190) Signed-off-by: Sebastian Schönnenbeck <[email protected]>

…Error to ValueError (vllm-project#18181) Signed-off-by: Abatom <[email protected]>

… unquantizedMethod to reenable LLama4 BF16 (vllm-project#18205) Signed-off-by: tjtanaa <[email protected]>

Signed-off-by: NickLucche <[email protected]>

Signed-off-by: Lucia Fang <[email protected]>

Signed-off-by: Lucas Wilkinson <[email protected]>

…-project#18229) Signed-off-by: Lucas Wilkinson <[email protected]>

…attention on ROCm (vllm-project#18093) Signed-off-by: kf <[email protected]>

Signed-off-by: lisiqi23 <[email protected]> Signed-off-by: skylee-01 <[email protected]> Co-authored-by: lisiqi23 <[email protected]>

Signed-off-by: Harry Mellor <[email protected]>

…llm-project#18209) Signed-off-by: Will Eaton <[email protected]>

…ce for V1 (vllm-project#17827) Signed-off-by: Lucia Fang <[email protected]>

Signed-off-by: David Xia <[email protected]>

vllm-project#17973) Signed-off-by: Vadim Gimpelson <[email protected]>

Signed-off-by: Seiji Eicher <[email protected]>

vllm-project#18214) Signed-off-by: Isotr0py <[email protected]>

Signed-off-by: Felix Marty <[email protected]>

Signed-off-by: learner0810 <[email protected]>

Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]>

Signed-off-by: Nick Hill <[email protected]>

Signed-off-by: mgoin <[email protected]>

ekagra-ranjan and others added 30 commits May 14, 2025 12:31

[V1][Spec Decode] Share input embedding of target model with EAGLE dr…

418d2f8

…aft model to free ~1GB for llama 3 model (vllm-project#17326) Co-authored-by: root <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

Modularize fused experts and integrate PPLX kernels (vllm-project#15956)

f9c069c

[CI] Disable Failing Tests (vllm-project#18165)

8568650

Local attention optimization for NIXL

7e55a34

Signed-off-by: mgoin <[email protected]>

Clean up a lot!

8ea467d

Signed-off-by: mgoin <[email protected]>

Small opt

73a8272

Signed-off-by: mgoin <[email protected]>

[Frontend] decrease import time of vllm.multimodal (vllm-project#18031)

749f792

Co-authored-by: Aaron Pham <[email protected]>

[Kernel] Have rotary embeddings support tensors (vllm-project#18046)

d93c976

Signed-off-by: Lucas Wilkinson <[email protected]>

[V1] Structured Outputs + Thinking compatibility (vllm-project#16577)

2fc9075

Signed-off-by: Aaron Pham <[email protected]> Co-authored-by: Russell Bryant <[email protected]>

Fix mypy

17cc4c9

Signed-off-by: mgoin <[email protected]>

Add support for loading torchao models with AOPerModuleConfig (vllm…

7974736

…-project#17826) Signed-off-by: Jerry Zhang <[email protected]>

[CI] Fix race condition in test_kv_cache_events test (vllm-project#18169

78aa341

) Signed-off-by: Russell Bryant <[email protected]>

[V1] Support multiple kv connectors (vllm-project#17564)

2142035

Signed-off-by: mgoin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]>

Upload vllm index for the rc builds (vllm-project#18173)

09f106a

[Bugfix]: make most of test_openai_schema.py pass (vllm-project#17664)

f25e0d1

[v1] Support multiple KV cache groups in GPU model runner (vllm-proje…

e60f550

…ct#17945) Signed-off-by: Chen Zhang <[email protected]>

[V1][Metrics] Remove unused code (vllm-project#18158)

65334ef

Signed-off-by: Mark McLoughlin <[email protected]>

[Chore] astral's ty (vllm-project#18116)

afe3236

Signed-off-by: Aaron Pham <[email protected]>

[Misc] add lobe-chat support (vllm-project#18177)

2dff093

Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]>

[Fix][ROCm] Enforce eager for all encoder-decoder models on ROCm (vll…

83f74c6

…m-project#18154) Signed-off-by: Luka Govedič <[email protected]>

Update deprecated type hinting in models (vllm-project#18132)

26d0419

Signed-off-by: Harry Mellor <[email protected]>

[Bugfix] Fix fp8 tests for triton_unified_attention for Triton 3.3 (v…

e6b8e65

…llm-project#18013) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]>

Support custom implementations of VideoLoader backends. (vllm-project…

4f07a64

…#18091)

[UT] Add ut for none hash (vllm-project#17892)

420caf7

Signed-off-by: Andy Xie <[email protected]>

[Model] Allow the use of sliding window in Qwen2 (vllm-project#17772)

dd2a945

Signed-off-by: inkcherry <[email protected]>

[Bugfix] Fix FusedMoEPrepareAndFinalize for cuda-disalike backends (v…

70f8b96

…llm-project#18178) Signed-off-by: Mengqing Cao <[email protected]>

[CI] don't skip fixed test_kv_cache_events() (vllm-project#18183)

de71fec

Signed-off-by: David Xia <[email protected]>

[V1] Update zmq socket creation in nixl connector (vllm-project#18148)

a8f5aec

Signed-off-by: Russell Bryant <[email protected]>

fix: typos (vllm-project#18151)

a9944aa

Signed-off-by: omahs <[email protected]>

Update deprecated type hinting in model_loader (vllm-project#18130)

07ad271

Signed-off-by: Harry Mellor <[email protected]>

schoennenbeck and others added 25 commits May 15, 2025 09:00

[Frontend] Fix chat template content format detection (vllm-project#1…

2aa5470

…8190) Signed-off-by: Sebastian Schönnenbeck <[email protected]>

[Bugfix]Change the exception thrown by call_hf_processor from Runtime…

fadb8d5

…Error to ValueError (vllm-project#18181) Signed-off-by: Abatom <[email protected]>

[Bugfix] [ROCm]: Remove assertion logic when using AITER fused moe in…

9254052

… unquantizedMethod to reenable LLama4 BF16 (vllm-project#18205) Signed-off-by: tjtanaa <[email protected]>

[Misc] Avoid cuda graph log when sizes still match (vllm-project#18202)

e3f3aee

Signed-off-by: NickLucche <[email protected]>

Adding "AMD: Tensorizer Test" to amdproduction. (vllm-project#18216)

0b34593

[Bugfix] Fix test_eagle test (vllm-project#18223)

8795eb9

Signed-off-by: Lucia Fang <[email protected]>

[Build] Allow shipping PTX on a per-file basis (vllm-project#18155)

c7852a6

Signed-off-by: Lucas Wilkinson <[email protected]>

[Bugfix] fix rotary embedding test for _get_padded_tensor_shape (vllm…

4e1c6a0

…-project#18229) Signed-off-by: Lucas Wilkinson <[email protected]>

[Bugfix][ROCm] Use chunked_prefill_paged_decode as fallback for V1 …

ee659e3

…attention on ROCm (vllm-project#18093) Signed-off-by: kf <[email protected]>

[Model] vLLM v1 supports Medusa (vllm-project#17956)

f4937a5

Signed-off-by: lisiqi23 <[email protected]> Signed-off-by: skylee-01 <[email protected]> Co-authored-by: lisiqi23 <[email protected]>

Allow users to pass arbitrary JSON keys from CLI (vllm-project#18208)

b18201f

Signed-off-by: Harry Mellor <[email protected]>

Throw better error for when running into k8s service discovery issue (v…

6b31c84

…llm-project#18209) Signed-off-by: Will Eaton <[email protected]>

[Feature] Support Pipeline Parallism in torchrun SPMD offline inferen…

3d2779c

…ce for V1 (vllm-project#17827) Signed-off-by: Lucia Fang <[email protected]>

[doc] fix multimodal example script (vllm-project#18089)

5c04bb8

Signed-off-by: David Xia <[email protected]>

[PERF] Speed up Qwen2.5-VL model by speed up rotary position embedding (

67da572

vllm-project#17973) Signed-off-by: Vadim Gimpelson <[email protected]>

[Misc] Add Ray Prometheus logger to V1 (vllm-project#17925)

5418176

Signed-off-by: Seiji Eicher <[email protected]>

[Misc] Consolidate Audio tests into multimodal common generation tests (

390ec88

vllm-project#18214) Signed-off-by: Isotr0py <[email protected]>

use ceil_div in cutlass block scaling shape check (vllm-project#17918)

e23564c

[Fix] Fix typo in resolve_hf_chat_template (vllm-project#18259)

a5f8c11

Signed-off-by: Felix Marty <[email protected]>

[Model] Use autoweightloader for dbrx (vllm-project#18251)

87d8714

Signed-off-by: learner0810 <[email protected]>

[Misc][MacOS] fix bfloat16 error (vllm-project#18249)

d3d91b6

Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]>

[BugFix] Fix multi async save in MultiConnector (vllm-project#18246)

1db4f47

Signed-off-by: Nick Hill <[email protected]>

Add TODO to remove

af2f264

Signed-off-by: mgoin <[email protected]>

Merge branch 'main' into nixl-l4-opt

e8fd2f1

Bug fixes

7f0ef82

Signed-off-by: mgoin <[email protected]>

mgoin requested review from youkaichao, njhill, robertgshaw2-redhat, tlrmchlsmth and alexm-redhat as code owners May 16, 2025 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nixl optimization for llama4 local attention #87

Nixl optimization for llama4 local attention #87

Uh oh!

mgoin commented May 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Nixl optimization for llama4 local attention #87

Are you sure you want to change the base?

Nixl optimization for llama4 local attention #87

Uh oh!

Conversation

mgoin commented May 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mgoin commented May 15, 2025 •

edited by github-actions bot

Loading