forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 6
[DO NOT MERGE] Refactor/aiter integration #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
vllmellm
wants to merge
581
commits into
refactor-fp8-linear
Choose a base branch
from
refactor/aiter_integration
base: refactor-fp8-linear
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 4 commits
Commits
Show all changes
581 commits
Select commit
Hold shift + click to select a range
6965a39
Fix: Resolve circular import in model_loader/utils.py (#29189)
nandan2003 2d4978a
fix: clean up function never use in setup.py (#29061)
yihong0618 5f7209a
[tiny] Remove unsupported TRITON_MLA backend from batch invariance (#…
bwasti 066209a
[Attention] Refactor FA `block_size` limitations to hybrid models onl…
NickLucche d44a63c
[BugFix] Fix returned logprobs with spec decode + prefill chunking (#…
njhill ae66818
[Misc] Fix pre-commit (#29238)
DarkLight1337 d84d8f4
Fix EVS crash when using `video_embeds` inputs in Qwen2.5-VL (#29232)
skyloevil f55c76c
chore: add RTX_PRO_6000 GLM4.6-FP8 kernel tuning (#29240)
coval3nte 730bd35
[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs wit…
fadara01 d1cf821
[Bugfix] Use HF config fields as fallback when loading Mistral config…
DarkLight1337 eb5352a
[CI/build] Removes source compilation from runtime image (#26966)
bbartels 7df331c
[BugFix] Fix chunked prompt logprobs + preemption (#29071)
njhill df78aee
Refactor: Move CUDA graph dispatch logic earlier (#27382)
yiz-liu 472fdee
[Chore] Update batch invariant code owner (#29246)
yewentao256 4587063
Patch DeepEP when building docker image with CUDA 13 (#29154)
soodoshll 5f96c00
[Fix] Add SM check to flashinfer MOE backend (#29144)
jiahanc 3ed767e
docs: fixes distributed executor backend config for multi-node vllm (…
michaelact 389aa1b
[Doc] Update more docs with respect to V1 (#29188)
DarkLight1337 20ee418
[Model Runner V2] Minor fix for cudagraph_utils (#29256)
WoosukKwon 71362ff
[CI/Build][AMD] Skip test_multi_shared_storage_connector_consistency …
rasmith 3999442
[CI/Build][AMD] Add check for flash_att_varlen_func to test_tree_atte…
rasmith 55c21c8
[ROCm][CI] Fix "Cannot re-initialize CUDA in forked subprocess" in te…
micah-wil 6fb0215
[Bugfix] Use lazy string reference for DeepseekV3Config in config reg…
yongming-qin 7f12c82
[Model Runner V2] Change bookkeeping logic in preparation for spec de…
WoosukKwon b004c00
[Model Runner V2] Support spec decoding [1/N] (#29274)
WoosukKwon 62d54ba
[Model Runner V2] Optimize CUDA graph capture time (#29275)
WoosukKwon 3e1ad40
[Model Runner V2] Add apply_temperature option to gumbel_sample (#29276)
WoosukKwon c309bb5
[Bugfix] Update Gradio OpenAI Chatbot Webserver example to new Gradio…
joshiemoore 1073ba6
[LoRA] Optimize 3D MoE logic (#29222)
jeejeelee 3085478
[Model] Add OpenCUA-7B support (#29068)
lim4349 5253f42
[ROCm] Support for Whisper v1 with Aiter Unified Attention and Aiter …
apinge 0ff7082
[Core] Deprecate `xformers` (#29262)
ywang96 ed40d85
[BugFix] Fix R-VL model loading error (#29299)
faaany 68dfe28
[Feature][Benchmark] add --link-vars can filter when serve_param equa…
lengrongfu 8005e60
[Bugfix][Rocm] Fix shared expert weight loading failure in DeepSeek-M…
zhyajie eca7a8f
[Doc]: fix typos in various files (#29230)
didier-durand 4de8786
[CPU][IBM Z] Fix BF16 support and vectorize math operations for s390x…
R3hankhan123 2601f18
[EPLB] Optimize EPLB for Async Rearrange Experts (#22179)
david6666666 f716a15
Update KServe guide link in documentation (#29258)
terrytangyuan 7a228b5
Add option to use unbacked, and backed size obl dynamic shapes for mo…
laithsakka e48b2e6
[Bugfix] [ROCm] [UX] Reorganize ROCm Backend Selection Logic (#26980)
vllmellm 656516c
[Bugfix] properly handle nested json with llama3 tool parser (#27701)
Aydin-ab e924bbb
[Build/CI][DP/EP] Add QWen/Qwen3-30B-A3B-FP8 + EPLB tests to Nightly …
varun-sundar-rabindranath 26a4655
[NIXL] Use config to enable telemetry + NIXL version bump (#29305)
NickLucche cc313cb
[Model Runner V2] Implement Single-step Eagle 1 (#29300)
WoosukKwon cec418b
[Model Runner V2] Change Numba AoT to JIT (#29328)
WoosukKwon 8f06614
[MoE][Refactor] Make select_experts a non-static method (#29067)
bnellnm 839c6b7
[Multimodal][Qwen3 Omni] Make Qwen3 Omni work with audio-in-video inp…
huachenheli 97588c4
[Model Runner V2] Add minor clarification comments for Eagle (#29332)
WoosukKwon 4d6afca
[CI/Build] Moves to cuda-base runtime image while retaining minimal J…
bbartels 3cfa63a
[XPU]fix Kimi-VL-A3B-thinking on xpu (#29309)
yma11 f32c7d6
[Model Runner V2] Simplify Eagle bookkeeping with num_rejected (#29347)
WoosukKwon 84371da
[Tests] Verify gpt_oss package is installed in harmony tests (#29336)
njhill 4dd42db
Remove VLLM_SKIP_WARMUP tip (#29331)
tlrmchlsmth 71df2a5
[Hybrid Allocator] Better layer padding strategy for gpt-oss eagle (#…
heheda12345 c17610e
[Bugfix] Only use triton_kernels for MXFP4 on SM90 and SM100 (#29339)
mgoin 699bca7
[UX] Raise error for attn backend of batch invariant (#29348)
yewentao256 5f9679a
[Spec Decode] Add support for EAGLE3 heads that do not use_aux_hidden…
hjjq b8328b4
[XPU] upgrade torch & ipex 2.9 on XPU platform (#29307)
jikunshang a178a0b
[BugFix] Fix duplicate id tool-call race condition (#29355)
njhill a4ad43a
Scheduled removal of `ParallelConfig`'s direct child EPLB fields (#29…
hmellor 6f1355a
[Perf] Disable DeepGEMM MoE by default when TP=8 is used (#29346)
mgoin 77e10c9
[Perf][Deepseek] optimize gather_and_maybe_dequant_cache kernel's per…
ganyi1996ppo cb7214d
[ROCm][MLA] enable fp8 MLA decode on ROCm (#28032)
gbyu-amd 22b42b5
[CI][ROCm] Install arctic-inference on ROCm tests (#29344)
divakar-amd 7012d8b
[Docker] Optimize Dockerfile: consolidate apt-get and reduce image si…
princepride 9cf4eda
[Metrics] Scheduled removal of deprecated metrics (#29330)
markmc 87185c8
[Bugfix] Make deprecated `--task embedding` consistent with `--runner…
maryamtahhan 92effb0
[Model] Add HunyuanOCR support (#29327)
Isotr0py 81db702
[Attention] add `_cudagraph_support` for linear attention (#28934)
ZJY0516 2d9ee28
[CI/Test Fix] Fix CP tests on Blackwell (#29338)
LucasWilkinson 316c849
Scheduled removal of `guided_*` config fields (#29326)
hmellor a21256c
Add TP CLI argument to multimodal inference examples (#29301)
faaany ce58fdc
Fix PoolingParams.skip_reading_prefix_cache type (#29364)
kflu 40a6f53
Display warning only when ROCm version is less than Pytorch required …
Inokinoki 7992324
[BugFix] Use unique ids for different transcription prompts (#29372)
njhill 64deead
[Bugfix] [ROCm] [UX]: revert Flex attention backend (#29371)
vllmellm 98caead
[fix][cpu] Use a SwigluOAI impl which supports interleaved gate-up we…
fadara01 fe3a4f5
[CI/Build] Pin torchgeo dependency for AMD (#29353)
rjrock 888152b
Allow oot custom compiler extension via CompilerInterface (#28623)
wxsIcey f242cfc
[Perf] use cpu all reduce to avoid sync when async_scheduling & dp > …
izhuhaoran 12c007e
EAGLE Support DP>1 (#26086)
Flechman ef1f703
[ROCm][CI] Fix test_cudagraph_mode failure in AMD CI (#29367)
micah-wil 6330f94
[Bugfix] Fix GPT-OSS AR+NORM fusion (#28841)
elvischenv 67fc16c
[Bugfix] If chunked_prefill is disabled, end the scheduling early. (#…
noooop db29061
[Misc] Streamline unique id generation (#29375)
njhill 32c40b9
[BugFix] bad_words filtering ineffective when n > 1 (#29313)
GOavi101 a685b47
[responsesAPI] refactor construct_input_messages (#29359)
qandrew e1dd706
[Frontend] Respect Chat Completion parallel_tool_calls param (#26233)
bbrowning 7a80b01
[CI] Resettle pooling entrypoints tests. (#29370)
noooop de68899
[Misc] Suppress log outputs when constructing the default vllm config…
noooop 798e87d
[Core] Generalize Encoder-Decoder `seq_lens` computation to avoid Whi…
NickLucche c2c661a
[Bugfix] Fix overallocation in MM profiling (#29386)
ywang96 bf0c75c
Make Transformers Nightly tests soft-fail and enable all tests (#29401)
hmellor 51fc9e0
Scheduled removal of `CompilationConfig.use_inductor` (#29323)
hmellor 516c3f7
[Bugfix] Fix logic for choosing default prefix caching setting (#29393)
tdoublep 0231ce8
Revert back to torch.equal over torch.allclose from #28819 (#29086)
eldarkurtic 794029f
[Feature]: Improve GGUF loading from HuggingFace user experience like…
sts07142 dbc3d99
[UX] Put CUDA attention backend selection log into one line (#29337)
mgoin e502098
[Kernel] Add NVFP4 MoE CUTLASS support for SM120 (#29242)
mgoin 48ddb02
[Hybrid Allocator] Support KV cache groups with different block_size …
ivanium a1f2676
Scheduled removal of `override_pooler_config` and `disable_log_reques…
hmellor 0353d2e
Fix RoPE related failures in Transformers nightly tests (#29333)
hmellor b07555d
[responsesAPI][2] parse ResponseFunctionToolCallOutputItem (#29383)
qandrew c32a18c
Attempt to fix GPU OOM in a spec-decoding test (#29419)
eldarkurtic e7d7762
[Compile] Refactor. Move PostGradPassManager out of Compilation confi…
ilmarkov 4e57c65
[Core] Support logprobs with spec decode + async scheduling (#29223)
njhill 0abc794
[caching] Add enable_prompt_embeds and cpu_offload_gb to compile hash…
zhxchen17 7df0289
Change warning logs to debug for unimplemented MXFP4 Linear/Attention…
mgoin de75b0b
[BugFix] Fix initialization of draft model. (#29319)
halyavin d8819c8
fix assertion for single world use case (uni) (#29429)
luccafong 12866af
dummy run corner case (#29433)
xieyangxu 56531b7
[Misc] Add backup hash algorithm for FIPS constrained environments (#…
geodavic 8d6a89d
[UX] Suppress gloo log spam (#29250)
mgoin c5ee430
Bump actions/checkout from 4 to 6 (#29293)
dependabot[bot] 53d7f1f
[Kernel] Use pre-allocated output buffer for triton kernel fused_expe…
xyang16 d9d342d
[Performance][MLA][ROCm] Remove redundant D2D copy in deepseek (#27457)
ganyi1996ppo 452a7c9
[Misc] Allow LM only loading for Pixtral (#29451)
ywang96 e30859d
[Bugfix] Fix handling of image embeds in models (#29480)
DarkLight1337 bb706d6
Fix TeleChatForCausalLM not register issue (#29473)
Yejing-Lai 3650a74
Optimize the wording of the document and unify the terminology and th…
Adityayxt 70d5953
Revert "[Bugfix] Fix GPT-OSS AR+NORM fusion (#28841)" (#29483)
hl475 0b0aa87
[Perf] Optimize batch invariant BMM, 18.1% Throughput improvement, 10…
yewentao256 e603129
[refactor] CTConfig methods to static/class methods (#28870)
HDCharles c4c0354
[CI/Build] allow user modify pplx and deepep ref by ENV or command li…
alec-flowers 430dd4d
[Attention] Remove imports from `vllm/attention/__init__.py` (#29342)
MatthewBonanni 56539cd
[Core] Refactor padding logic and pad for CUDA graphs before attentio…
LucasWilkinson ba1fcd8
[TPU] add tpu_inference (#27277)
jcyang43 df01eda
[Bugfix] Make compressed-tensors MoEs respect ignored layers (#28878)
HDCharles 7774019
[Attention][Async] Eliminate `seq_lens_cpu` in FlashAttention metadat…
MatthewBonanni a67dec7
[Bugfix] fix IMA issue in certain cases of the moe marlin kernel (#28…
jinzhen-lin 9bb33c8
add xpu supported model and model id for cpu (#29380)
louie-tsai 0aeb698
[Model Runner V2] Minor code cleanup (#29570)
WoosukKwon ee80aee
[Model Runner V2] Minor cleanup for build_attn_metadata (#29576)
WoosukKwon da8e1a1
[DOC] Add vLLM Bangkok Meetup info (#29561)
tjtanaa ecb1952
[cpu][fix] Fix Arm CI tests (#29552)
fadara01 11ea5ec
[Model Runner V2] Refactor CudaGraphManager (#29583)
WoosukKwon c069086
[Bugfix] Fix getting device for MoE LoRA (#29475)
jeejeelee 3ecabd0
Fix tpu-inference platform path (#29554)
jcyang43 43c5792
[ROCm][CI] Fix test_cpu_offloading for ROCm (#29548)
micah-wil da3222f
[Model Runner V2] Implement multi-step Eagle with CUDA graph (#29559)
WoosukKwon 00d3310
[Bugfix] Update Ultravox compatibility (#29588)
DarkLight1337 0838b52
[Frontend][torch.compile] CompilationConfig Overhaul (#20283): Set up…
morrison-turnansky 51906c8
[Docs] Improve `priority` parameter documentation (#29572)
maang-h e6d4f3c
[Bugfix] Fix pre-commit (#29601)
DarkLight1337 a5abd1d
[CI] Auto label CPU related issues (#29602)
bigPYJ1151 cf348c8
[Bugfix] Fix HunyuanVL XD-RoPE (#29593)
ywang96 2f5f9ac
[LoRA] Continue optimizing MoE LoRA weight loading (#29322)
jeejeelee 882851d
[CI/Build][Bugfix] Fix auto label issues for CPU (#29610)
bigPYJ1151 bab438f
[CI/Build] Skip ray tests on ROCm (#29556)
rjrock 66d3d54
[Doc]: fixing typos in diverse files (#29492)
didier-durand cd007a5
[bugfix] avoid NIXL_ERR_REMOTE_DISCONNECT in nixl_connector when Pref…
hasB4K fc1d8be
[Attention] Update attention imports (#29540)
MatthewBonanni e1f2623
Update Transformers pin in CI to 4.57.3 (#29418)
hmellor 0840abd
[BugFix] Optional tokenizer argument when loading GGUF models (#29582)
sts07142 ee9841d
[Bugfix] Fix doc build on main (#29619)
DarkLight1337 d45269b
add skip_reading_prefix_cache in repr for PoolingParams (#29620)
guodongxiaren ea228b4
[Misc] Remove unused code from `protocol.py` (#29616)
DarkLight1337 a24ea54
[Deprecation] Advance deprecation status (#29617)
DarkLight1337 38658ec
[Bugfix][MM encoder] Fix ViT attention backend resolving for Turing G…
Isotr0py e5a621b
[CI] Add batched audios Whisper test (#29308)
NickLucche a5345bf
[BugFix] Fix `plan` API Mismatch when using latest FlashInfer (#29426)
askliar ae0ce1b
[Model Runner V2][BugFix] Keep reference to GPU tensors in AsyncOutpu…
WoosukKwon be493e0
[BugFix] Fix new nightly failures (#29578)
LucasWilkinson 35657bc
[CPU]Update CPU PyTorch to 2.9.0 (#29589)
scydas 745a3ba
[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 (#28971)
xyang16 18523b8
[Docs] Update supported models for Olmo 3 in tool calling documentati…
wilsonwu c7ba1f6
[BugFix] Fix ValueError in NewRequestData repr methods (#29392)
maang-h 37b15e9
[Multimodal][Speculative Decoding]Eagle3 mm support, enablement on qw…
EanWang211123 f4b7605
Improve enable chunked_prefill & prefix_caching logic. (#26623)
noooop b34e877
Revert "[CPU]Update CPU PyTorch to 2.9.0 (#29589)" (#29647)
DarkLight1337 4805989
[Feature][Bench] Add pareto visualization (#29477)
lengrongfu cc0f2a0
[Doc] Improve abnormal information string (#29655)
maang-h b2c1d29
[BUGFIX] MistralTokenizer._call__ adds an invalid EOS token (#29607)
juliendenize 5f5521b
Fix parameter order in GPT-OSS weight loading function for non-MXFP4 …
qGentry ccbdf51
[Doc] Reorganize benchmark docs (#29658)
DarkLight1337 3cb32e5
[Rocm] Set VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS default is disab…
zhyajie 5c2b5cb
[Docs] Add SPLADE and Ultravox models to supported models documentati…
wilsonwu 33b06a6
[Misc] Remove redundant attention var constants (#29650)
DarkLight1337 953d9c8
[mypy] Pass type checking for `vllm/utils` and `vllm/v1/pool` (#29666)
DarkLight1337 8e7a891
[BugFix] Fix spec decoding max_tokens scheduling perf issue (#29542)
njhill 1168768
[Optimization] Early return for `_apply_matches` and `_iter_placehold…
DarkLight1337 f8151b6
Revert "Supress verbose logs from model_hosting_container_standards (…
HappyAmazonian e2f56c3
[CPU] Update torch 2.9.1 for CPU backend (#29664)
bigPYJ1151 460d8bb
Remove upstream fa checks (#29471)
Victor49152 0808eb8
[Misc] Remove `yapf` directives (#29675)
DarkLight1337 9eec282
Guard FlashInfer sampler using the same check as FlashInfer attention…
hmellor 9e6bcda
[mypy] Enable type checking for more directories (#29674)
DarkLight1337 3bcbb30
add add_truncate_prompt_tokens in repr for PoolingParams (#29683)
guodongxiaren fae6943
[Doc]: fixing typos in multiple files. (#29685)
didier-durand 6f9d81d
[V0 deprecation] Clean up legacy paged attention helper functions (#2…
Isotr0py f946a8d
[Chore]: Reorganize model repo operating functions in `transformers_u…
Isotr0py 4332955
[Docs] Add CLI reference doc for `vllm bench sweep plot_pareto` (#29689)
hmellor d40c854
[CI/Build] Rework CPU multimodal processor test (#29684)
Isotr0py 8d9338f
[Chore] Rename `Processor` to `InputProcessor` (#29682)
DarkLight1337 fecae12
Remove `all_special_tokens_extended` from tokenizer code (#29686)
hmellor 3461e7e
[Frontend] Remap -O to -cc commandline flag (#29557)
gmagogsfm 1986de1
[Perf] Optimize EAGLE prepare_inputs_padded with triton kernels (#28597)
benchislett 7c1ed45
[CI/Build]: make it possible to build with a free-threaded interprete…
rgommers 7675ba3
[Misc] Remove redundant `ClassRegistry` (#29681)
DarkLight1337 a51f418
[Bugfix] fix dots.llm1.inst (#29687)
ZJY0516 3fd1fb0
Revert "[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 (#28971)…
hl475 9726e64
bugfix: correct attn output with base 2 or e (#28840)
staugust 6173682
[compile] Include `enable_sleep_mode` into caching factors. (#29696)
zhxchen17 c625d7b
[Bugfix] Fix O(n²) multimodal string prompt processing (#29667)
mertunsall ea3370b
[ROCm][Bugfix] Patch for the `Multi-Modal Processor Test` group (#29702)
AndreasKaratzas 1dcafb3
[Model Runner V2] Support penalties using bin counts (#29703)
WoosukKwon b2c50ed
[Bugfix] Fix wrong mock attribute (#29704)
DarkLight1337 762a4a6
[Frontend] Perform offline path replacement to `tokenizer` (#29706)
a4lg ca1b1e7
[Model Runner V2] Refactor prefill token preparation (#29712)
WoosukKwon e23f665
[BugFix] Fix DBO failing with TypeError: 'NoneType' object is not ite…
LucasWilkinson 4b17ce6
Add gpu memory wait before test_async_tp (#28893)
angelayi 4a80ad0
[Model Runner V2] Don't use UVA buffer for prefill_len (#29713)
WoosukKwon 39e63de
[LoRA] Cleanup LoRA unused code (#29611)
jeejeelee 6afc0ff
[Model Runner V2] Add sample/ directory and reorganize files (#29719)
WoosukKwon 04a797c
[Doc]: fixing typos in various files. (#29717)
didier-durand f223ed4
[Model Runner V2] Fuse penalties and temperature into single kernel (…
WoosukKwon 34a9842
[Misc] Refactor tokenizer interface (#29693)
DarkLight1337 f4341f4
[Doc]: fix code block rendering (#29728)
dublc ad7f714
hfrunner.classify should return list[list[float]] not list[str] (#29671)
nwaughachukwuma fe3398f
[Chore] Enable passing `tokenizer=None` into MM processor (#29724)
DarkLight1337 fa59fe4
[Chore] Move `detokenizer_utils` to `vllm/tokenizers` (#29727)
DarkLight1337 1656ad3
[Kernel][Quantization] add w4a8 support for marlin kernel (#24722)
jinzhen-lin b9d0504
[Bugfix] Revert test_tokenization.py (#29729)
jeejeelee a491b09
[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 (#29708)
xyang16 e1464c3
[Quantization] Enable compressed-tensors AWQ for Turing GPU (#29732)
Isotr0py 82c795d
Fix AttributeError about _use_fi_prefill (#29734)
hl475 66b5840
[Bugfix][sleepmode][fp8 kv cache]: Fix FP8 KV cache + sleep(level=2) …
Flink-ddd 9381b5c
[Doc]: Fix typo in fused_moe layer (#29731)
BowTen 2afcec4
[Misc] Update `TokenizerLike` interface and move `get_cached_tokenize…
DarkLight1337 47539cf
[Bugfix] Fix mismatched nvfp4 gemm output shape (#29742)
Isotr0py 64bc09b
[Core] Enable `inputs_embeds_size` separate from `hidden_size` (#29741)
DarkLight1337 8c363ed
[ROCm][Attention] Sliding window support for `AiterFlashAttentionBack…
ganyi1996ppo cd719de
Fix RoPE failures in Transformers nightly (#29700)
hmellor 39d2810
[Feat] Support non-gated activations in NVFP4 modelopt path (#29004)
omera-nv 21c2627
[Misc]Remove redundant hidden_size property in ModelConfig (#29749)
charlotte12l ec38a73
[Model Runner V2] Use packed mask for prompt bin counts (#29756)
WoosukKwon f72a817
[MoE] CuteDSL MoE with Nvfp4 DeepEP dispatch (#27141)
wenscarl 1ab8fc8
Make PyTorch profiler gzip and CUDA time dump configurable (#29568)
zhangruoxu 83805a6
[CI] Skip paddleocr_vl for transformer 4.57.3 (#29758)
hl475 62de4f4
[Frontend] Resettle pooling entrypoints (#29634)
noooop 014ece9
[Frontend] Add tool filtering support to ToolServer (#29224)
daniel-salib 86e178f
[crashfix] Eagle + multimodal can crash on mm cache miss (#29750)
mickaelseznec f0a28bf
[Misc] Unify tokenizer registration (#29767)
DarkLight1337 f37e893
[XPU] Fix AWQ skipped layer detection in IPEX quantization (#29774)
faaany ad9d656
[multimodal][test] Reduce memory utilization for test_siglip to avoid…
zhxchen17 b95db24
[v1] Add real sliding window calculation to FlexAttention direct Bloc…
Isotr0py 5cfa967
[Bugfix] TypeError: 'NoneType' object is not callable (#29414)
mostrowskix 36db0a3
[CI] Renovation of nightly wheel build & generation (#29690)
Harry-Chen 30624ea
sync upstream
vllmellm File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,14 +2,24 @@ | |
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
|
|
||
|
|
||
| from collections.abc import Callable | ||
|
|
||
| import torch | ||
| from aiter.ops.shuffle import shuffle_weight | ||
|
|
||
| from vllm import _custom_ops as ops | ||
| from vllm._aiter_ops import rocm_aiter_ops | ||
| from vllm.logger import init_logger | ||
| from vllm.platforms import current_platform | ||
|
|
||
| from .cutlass import CutlassScaledMMLinearKernel | ||
| from .ScaledMMLinearKernel import Int8ScaledMMLinearLayerConfig | ||
| from .ScaledMMLinearKernel import ( | ||
| FP8ScaledMMLinearKernel, | ||
| FP8ScaledMMLinearLayerConfig, | ||
| Int8ScaledMMLinearLayerConfig, | ||
| ) | ||
|
|
||
| logger = init_logger(__name__) | ||
|
|
||
|
|
||
| class AiterScaledMMLinearKernel(CutlassScaledMMLinearKernel): | ||
|
|
@@ -115,3 +125,160 @@ def apply_weights( | |
| # b to be [N, K] | ||
| # CutlassScaledMMLinearKernel prepare weight `w_q` in [K, N] format | ||
| return rocm_aiter_ops.gemm_a8w8(x_q, w_q.t(), x_s, w_s, bias, out_dtype) | ||
|
|
||
|
|
||
| class AiterBpreshufflePerTokenFp8ScaledMMLinearKernel(FP8ScaledMMLinearKernel): | ||
| def get_ouput_padding(self) -> int | None: | ||
| # PTPC kernels do not require padding. | ||
| return None | ||
|
|
||
| @classmethod | ||
| def can_implement(cls, c: FP8ScaledMMLinearLayerConfig) -> tuple[bool, str | None]: | ||
| if not current_platform.is_rocm(): | ||
| return (False, "AITER bpreshuffle is ROCm-only") | ||
|
|
||
| if not rocm_aiter_ops.is_linear_enabled(): | ||
| return (False, "AITER bpreshuffle is disabled by env var") | ||
|
|
||
| try: | ||
| import aiter # noqa: F401 | ||
| except Exception: | ||
| return (False, "AITER not installed") | ||
|
|
||
| # Check if the configuration is PTPC | ||
| is_per_channel_weight = c.weight_quant_key.scale.group_shape.is_per_token() | ||
| is_per_token_activation = ( | ||
| c.activation_quant_key.scale.group_shape.is_per_token() | ||
| ) | ||
| is_ptpc = is_per_channel_weight and is_per_token_activation | ||
|
|
||
| logger.info_once(f"AiterBpreshuffle: can_implement called. is_ptpc={is_ptpc}") | ||
|
|
||
| if not is_ptpc: | ||
| return (False, "This kernel only handles Per-Token/Per-Channel (PTPC)") | ||
|
|
||
| return True, None | ||
|
|
||
| def process_weights_after_loading(self, layer: torch.nn.Module) -> None: | ||
| logger.info_once("AiterBpreshuffle: SHUFFLING WEIGHTS NOW.") | ||
|
|
||
| w_q, _, _, _ = self._get_layer_params(layer) | ||
|
|
||
| N = w_q.shape[1] | ||
| K = w_q.shape[0] | ||
|
|
||
| if N % 16 == 0 and K % 16 == 0: | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add https://github.com/ROCm/vllm/blob/c88d6d2ec7299605bb2ed8a4aee9260d90ef0631/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py#L153 to the rocm_aiter_ops and use that to replace this |
||
| # AITER shuffle_weight expectation [N, K] | ||
| w_q_nk = w_q.t().contiguous() | ||
|
|
||
| # Execute shuffle | ||
| shuffled_w_nk = shuffle_weight(w_q_nk, layout=(16, 16)) | ||
|
|
||
| del layer.weight | ||
| layer.register_buffer("weight", shuffled_w_nk) | ||
|
|
||
| logger.info_once("[AiterBpreshuffle: Weight shuffle COMPLETE.") | ||
|
|
||
| else: | ||
| raise ValueError( | ||
| f"Weight shape (N={N}, K={K}) not divisible by 16 " | ||
| "for AITER bpreshuffle." | ||
| ) | ||
|
|
||
| def apply_weights( | ||
| self, | ||
| layer: torch.nn.Module, | ||
| x: torch.Tensor, | ||
| bias: torch.Tensor | None = None, | ||
| ) -> torch.Tensor: | ||
| # 1. Obtain parameters | ||
| w_q, w_s, x_s, x_s_ub = self._get_layer_params(layer) | ||
| # 2. Dynamic quantization input | ||
| qinput, qinput_scale = self.quant_fp8(x, x_s, x_s_ub) | ||
|
|
||
| logger.info_once( | ||
| "AiterBpreshuffle: apply_weights... ABOUT TO CALL C++ KERNEL..." | ||
| ) | ||
|
|
||
| output = rocm_aiter_ops.gemm_a8w8_bpreshuffle( | ||
| qinput, | ||
| w_q, # Input [N, K] shuffle weights | ||
| out_dtype=self.config.out_dtype, | ||
| scale_a=qinput_scale, | ||
| scale_b=w_s, | ||
| ) | ||
|
|
||
| logger.info_once("AiterBpreshuffle: C++ KERNEL CALL SUCCEEDED.") | ||
|
|
||
| if bias is not None: | ||
| output.add_(bias) | ||
| return output | ||
|
|
||
| def get_scaled_mm_func(self) -> Callable[..., torch.Tensor]: | ||
| return rocm_aiter_ops.gemm_a8w8_bpreshuffle | ||
|
|
||
|
|
||
| class AiterCKPerTokenFp8ScaledMMLinearKernel(FP8ScaledMMLinearKernel): | ||
| """ | ||
| AITER PTPC kernel (gemm_a8w8_CK) without pre-shuffling. | ||
| """ | ||
|
|
||
| def get_ouput_padding(self) -> int | None: | ||
| return None | ||
|
|
||
| @classmethod | ||
| def can_implement(cls, c: FP8ScaledMMLinearLayerConfig) -> tuple[bool, str | None]: | ||
| if not current_platform.is_rocm(): | ||
| return (False, "AITER CK is ROCm-only") | ||
|
|
||
| if not rocm_aiter_ops.is_linear_enabled(): | ||
| return (False, "AITER CK is disabled by env var") | ||
|
|
||
| try: | ||
| import aiter # noqa: F401 | ||
| except Exception: | ||
| return (False, "AITER not installed") | ||
|
|
||
| is_per_channel_weight = c.weight_quant_key.scale.group_shape.is_per_token() | ||
| is_per_token_activation = ( | ||
| c.activation_quant_key.scale.group_shape.is_per_token() | ||
| ) | ||
| is_ptpc = is_per_channel_weight and is_per_token_activation | ||
|
|
||
| logger.info_once(f"AiterCK: can_implement called. is_ptpc={is_ptpc}") | ||
|
|
||
| if not is_ptpc: | ||
| return (False, "This kernel only handles Per-Token/Per-Channel (PTPC)") | ||
|
|
||
| return True, None | ||
|
|
||
| def process_weights_after_loading(self, layer: torch.nn.Module) -> None: | ||
| logger.info_once( | ||
| "AITER CK: process_weights_after_loading... DOING NOTHING (pass)." | ||
| ) | ||
| pass | ||
|
|
||
| def apply_weights( | ||
| self, | ||
| layer: torch.nn.Module, | ||
| x: torch.Tensor, | ||
| bias: torch.Tensor | None = None, | ||
| ) -> torch.Tensor: | ||
| w_q, w_s, x_s, x_s_ub = self._get_layer_params(layer) | ||
|
|
||
| qinput, qinput_scale = self.quant_fp8(x, x_s, x_s_ub) | ||
|
|
||
| logger.info_once( | ||
| "AiterCK: apply_weights... " | ||
| "ABOUT TO CALL C++ KERNEL (this is where it hangs)..." | ||
| ) | ||
|
|
||
| output = rocm_aiter_ops.gemm_a8w8( | ||
| qinput, w_q.t(), qinput_scale, w_s, bias, self.config.out_dtype | ||
| ) | ||
|
|
||
| logger.info_once("AiterCK: C++ KERNEL CALL SUCCEEDED.") | ||
| return output | ||
|
|
||
| def get_scaled_mm_func(self) -> Callable[..., torch.Tensor]: | ||
| return rocm_aiter_ops.gemm_a8w8 | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if fp8_linear is initialised.