[BugFix]Add int8 cache dtype when using ascend attention quantization #125

Angazenn · 2025-02-21T01:44:09Z

Ascend attention requires int8 kvcache dtype. It is used in initialization of CacheConfig:
if cache_config.cache_dtype == "auto": self.dtype = model_config.dtype else: self.dtype = STR_DTYPE_TO_TORCH_DTYPE[cache_config.cache_dtype]
STR_DTYPE_TO_TORCH_DTYPE is defined in vllm.utils:
STR_DTYPE_TO_TORCH_DTYPE = { "half": torch.half, "bfloat16": torch.bfloat16, "float": torch.float, "fp8": torch.uint8, "fp8_e4m3": torch.uint8, "fp8_e5m2": torch.uint8, }
Hence we need to update both cache_dtype and STR_DTYPE_TO_TORCH_DTYPE.

Signed-off-by: hw_whx <[email protected]>

…ject#20) This PR tries to register mindie_turbo while initializing NPUWorker. The register function is added into a new file named utils.py --------- Signed-off-by: hw_whx <[email protected]> Co-authored-by: hw_whx <[email protected]>

…ease Signed-off-by: hw_whx <[email protected]>

Signed-off-by: hw_whx <[email protected]>

[Hardware][Ascend] Add silu_and_mul/rope; Add mix ops into attention layer

This pr adds ascend quantization interface to vllm-ascend, including AscendQuantConfig class which inherits from vllm QuantizationConfig class, AscendLinearMethod class which inherits from vllm LinearMethodBase class, AscendQuantizer class that dispatches corresponding quanzation methods. --------- Signed-off-by: angazenn <[email protected]> Co-authored-by: angazenn <[email protected]>

…ect#28) cherry-pick from c59375c Signed-off-by: wangxiyuan <[email protected]>

Signed-off-by: angazenn <[email protected]>

Some PR for plugin support is not merged by vllm yet. This PR add monkey patch to vllm-ascend to make vllm-ascend work with vllm directly. This patch code should be removed once the related function is supported by vllm originally. cherry pick to 0.7.1 Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: angazenn <[email protected]>

fix packages for finding submodule. see vllm-project#42 Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>

### What this PR does / why we need it? This PR updates the dependency version of vllm-ascend on torch-npu, so that the vllm-ascend can be installed in a later version environment (like to torch-npu 2.6.0rc1), ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI test Signed-off-by: ji-huazhong <[email protected]> Signed-off-by: angazenn <[email protected]>

Fix communicator patch for distributed inferencing. We should patch `GroupCoordinator` with its module, and just before initializing distributed env. So that the patch won't be shadowed by the import of `init_distributed_environment` in `worker.py` Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>

vllm-project#54) ### What this PR does / why we need it? In open-r1, the rank 0 process will create an LLM instance and load the model to `npu:7`. We need to force the output tensor to be created on the same device as the query tensor. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test by main branch Signed-off-by: angazenn <[email protected]>

…ize to 128 in platform.py Signed-off-by: hw_whx <[email protected]> Signed-off-by: angazenn <[email protected]>

Signed-off-by: hw_whx <[email protected]> Signed-off-by: angazenn <[email protected]>

### What this PR does / why we need it? Backport main docs to make CI happy: ``` cp -r ../vllm-ascend-main/docs ./ cp ../vllm-ascend-main/README* ./ cp ../vllm-ascend-main/.readthedocs.yaml ./ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed ``` cp -r ../vllm-ascend-main/docs ./ cp ../vllm-ascend-main/README* ./ cp ../vllm-ascend-main/.readthedocs.yaml ./ ``` no diff Signed-off-by: Yikun Jiang <[email protected]> Signed-off-by: angazenn <[email protected]>

### What this PR does / why we need it? Backport vllm-project#64 to v0.7.1-dev branch Add container image build ci: - Enable branch, tag docker image publish - branch image: `vllm-ascend:main`, `vllm-ascend:v0.7.1-dev` - tag image: `vllm-ascend:v0.7.1rc1` - Enable PR docker image build check - other changes: - Prepare the `REPO_OWNER` because the ghcr lowerercase required - Add `Free up disk space` step to avoid `No space left on device` like vllm-project#27 - Setup qemu with image to resolve docker/setup-qemu-action#198 ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? build: CI passed --------- Signed-off-by: Yikun Jiang <[email protected]> Signed-off-by: angazenn <[email protected]>

### What this PR does / why we need it? Add vllm-ascend tutorials for v0.7.1. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. Signed-off-by: Shanshan Shen <[email protected]> Signed-off-by: angazenn <[email protected]>

### What this PR does / why we need it? Refeactor installation doc backport vllm-project#80 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI, preview Signed-off-by: Yikun Jiang <[email protected]> Signed-off-by: angazenn <[email protected]>

This PR add attention quantization interfaces, including AscendQKVQuantAttentionMethod class inherited from BaseKVCacheMethod class. --------- Signed-off-by: angazenn <[email protected]> Co-authored-by: angazenn <[email protected]> Signed-off-by: angazenn <[email protected]>

### What this PR does / why we need it? Update tutorials. Backport vllm-project#79 ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? ci. Signed-off-by: Shanshan Shen <[email protected]> Signed-off-by: Yikun Jiang <[email protected]> Co-authored-by: Shanshan Shen <[email protected]> Signed-off-by: angazenn <[email protected]>

cherry-pick from vllm-project#59 Signed-off-by: wangxiyuan <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Signed-off-by: angazenn <[email protected]>

### What this PR does / why we need it? - Set default model to Qwen2.5-0.5B-Instruct in example - Remove Ultravox 0.3 because it is not tested currently Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>

Add npu implement for FusedMoE Signed-off-by: YHT <[email protected]> Co-authored-by: YHT <[email protected]> Signed-off-by: angazenn <[email protected]>

### What this PR does / why we need it?  To adapt to the MLA structure of vLLM DeepSeek on Ascend hardware, write the AscendMLAAttentionBackendImpl class. ### Does this PR introduce _any_ user-facing change?  Users can choose to set VLLM_MLA_DISABLE to 1 or 0 to disable or enable MLA. ### How was this patch tested?  Signed-off-by: YHT <[email protected]> Co-authored-by: YHT <[email protected]> Signed-off-by: angazenn <[email protected]>

cherry-pick from vllm-project#90 Add dynamic version in docs Signed-off-by: Yikun Jiang <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Signed-off-by: angazenn <[email protected]>

) 1. Update CANN image name 2. Add pta install step 3. update vllm-ascend docker image name to ghcr 4. update quick_start to use vllm-ascend image directly. 5. fix `note` style cherry-pick from vllm-project#85 Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: angazenn <[email protected]>

…blems. (vllm-project#95) fix an accuracy problem caused by missing of value contiguous Signed-off-by: hw_whx <[email protected]> Co-authored-by: hw_whx <[email protected]> Signed-off-by: angazenn <[email protected]>

Final update for 0.7.1.rc1 release. 1. Update the version in doc 2. Update dockerfile Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: angazenn <[email protected]>

Update feature and model lists Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>

change docker registry to quay Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: angazenn <[email protected]>

update feature support plan Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>

Fix a bug caused by omitting to change the parameter name. Signed-off-by: YHT <[email protected]> Co-authored-by: YHT <[email protected]> Signed-off-by: angazenn <[email protected]>

Don't login docker registry in pull request. Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: angazenn <[email protected]>

update model list Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>

Signed-off-by: angazenn <[email protected]>

hw_whx and others added 7 commits February 7, 2025 12:40

feat: add silu_and_mul/rope; add mix ops into attention layer

7d16772

Signed-off-by: hw_whx <[email protected]>

fix: remove two EqScalar operations which will cause performance decr…

3505fb1

…ease Signed-off-by: hw_whx <[email protected]>

fix review problems

c8df05e

Signed-off-by: hw_whx <[email protected]>

Merge pull request vllm-project#18 from whx-sjtu/add_four_mixops

49e5baf

[Hardware][Ascend] Add silu_and_mul/rope; Add mix ops into attention layer

[Misc] version control by setuptools_scm (vllm-project#21) (vllm-proj…

cb28d33

…ect#28) cherry-pick from c59375c Signed-off-by: wangxiyuan <[email protected]>

Angazenn changed the base branch from v0.7.1-dev to main February 21, 2025 01:45

Angazenn changed the base branch from main to v0.7.1-dev February 21, 2025 01:45

angazenn and others added 21 commits February 21, 2025 09:46

adapt to new config.json

38b805c

Signed-off-by: angazenn <[email protected]>

clean code

70859ae

Signed-off-by: angazenn <[email protected]>

narrow down the exception range

0e83b59

Signed-off-by: angazenn <[email protected]>

disable warning

4b9f96d

Signed-off-by: angazenn <[email protected]>

[Misc][Build] Fix packages for finding submodule (vllm-project#43)

aec2d4d

fix packages for finding submodule. see vllm-project#42 Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>

feat: optimize the to conversion in rope forward; set default block_s…

764acad

…ize to 128 in platform.py Signed-off-by: hw_whx <[email protected]> Signed-off-by: angazenn <[email protected]>

fix review problems

b891a0b

Signed-off-by: hw_whx <[email protected]> Signed-off-by: angazenn <[email protected]>

[Doc] Add release note (vllm-project#59) (vllm-project#83)

e483bc3

cherry-pick from vllm-project#59 Signed-off-by: wangxiyuan <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Signed-off-by: angazenn <[email protected]>

[Hardware][Ascend]forward_oot for FusedMoE (vllm-project#68)

371fdbd

Add npu implement for FusedMoE Signed-off-by: YHT <[email protected]> Co-authored-by: YHT <[email protected]> Signed-off-by: angazenn <[email protected]>

wangxiyuan and others added 12 commits February 21, 2025 09:46

[Docs] Add dynamic version in docs (vllm-project#90) (vllm-project#91)

a99a354

cherry-pick from vllm-project#90 Add dynamic version in docs Signed-off-by: Yikun Jiang <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Signed-off-by: angazenn <[email protected]>

[Release] doc and dockerfile update for 0.7.1.rc1 (vllm-project#94)

f75a44d

Final update for 0.7.1.rc1 release. 1. Update the version in doc 2. Update dockerfile Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: angazenn <[email protected]>

[doc] update feature and model lists (vllm-project#99)

3b6d6a4

Update feature and model lists Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>

[CI] change to quay.io (vllm-project#98)

830662d

change docker registry to quay Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: angazenn <[email protected]>

[doc] update feature support plan (vllm-project#101)

fb5f439

update feature support plan Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>

[BugFix] Delete redundant variables (vllm-project#103)

85a9331

Fix a bug caused by omitting to change the parameter name. Signed-off-by: YHT <[email protected]> Co-authored-by: YHT <[email protected]> Signed-off-by: angazenn <[email protected]>

[CI] Skip registry login if not push (vllm-project#104)

10048aa

Don't login docker registry in pull request. Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: angazenn <[email protected]>

[doc] update model list (vllm-project#105)

7239745

update model list Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>

add int8 kvcache dtype for ascend attention quant config

af8f73c

Signed-off-by: angazenn <[email protected]>

modify description

ccf101a

Signed-off-by: angazenn <[email protected]>

Angazenn force-pushed the develop branch from b20bbd0 to ccf101a Compare February 21, 2025 01:47

Angazenn changed the base branch from v0.7.1-dev to main February 21, 2025 01:49

Angazenn closed this Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix]Add int8 cache dtype when using ascend attention quantization #125

[BugFix]Add int8 cache dtype when using ascend attention quantization #125

Angazenn commented Feb 21, 2025

[BugFix]Add int8 cache dtype when using ascend attention quantization #125

[BugFix]Add int8 cache dtype when using ascend attention quantization #125

Conversation

Angazenn commented Feb 21, 2025