-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix]Add int8 cache dtype when using ascend attention quantization #125
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: hw_whx <[email protected]>
…ject#20) This PR tries to register mindie_turbo while initializing NPUWorker. The register function is added into a new file named utils.py --------- Signed-off-by: hw_whx <[email protected]> Co-authored-by: hw_whx <[email protected]>
…ease Signed-off-by: hw_whx <[email protected]>
Signed-off-by: hw_whx <[email protected]>
[Hardware][Ascend] Add silu_and_mul/rope; Add mix ops into attention layer
This pr adds ascend quantization interface to vllm-ascend, including AscendQuantConfig class which inherits from vllm QuantizationConfig class, AscendLinearMethod class which inherits from vllm LinearMethodBase class, AscendQuantizer class that dispatches corresponding quanzation methods. --------- Signed-off-by: angazenn <[email protected]> Co-authored-by: angazenn <[email protected]>
…ect#28) cherry-pick from c59375c Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: angazenn <[email protected]>
Signed-off-by: angazenn <[email protected]>
Signed-off-by: angazenn <[email protected]>
Signed-off-by: angazenn <[email protected]>
Some PR for plugin support is not merged by vllm yet. This PR add monkey patch to vllm-ascend to make vllm-ascend work with vllm directly. This patch code should be removed once the related function is supported by vllm originally. cherry pick to 0.7.1 Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: angazenn <[email protected]>
fix packages for finding submodule. see vllm-project#42 Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>
### What this PR does / why we need it? This PR updates the dependency version of vllm-ascend on torch-npu, so that the vllm-ascend can be installed in a later version environment (like to torch-npu 2.6.0rc1), ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI test Signed-off-by: ji-huazhong <[email protected]> Signed-off-by: angazenn <[email protected]>
Fix communicator patch for distributed inferencing. We should patch `GroupCoordinator` with its module, and just before initializing distributed env. So that the patch won't be shadowed by the import of `init_distributed_environment` in `worker.py` Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>
vllm-project#54) ### What this PR does / why we need it? In open-r1, the rank 0 process will create an LLM instance and load the model to `npu:7`. We need to force the output tensor to be created on the same device as the query tensor. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test by main branch Signed-off-by: angazenn <[email protected]>
…ize to 128 in platform.py Signed-off-by: hw_whx <[email protected]> Signed-off-by: angazenn <[email protected]>
Signed-off-by: hw_whx <[email protected]> Signed-off-by: angazenn <[email protected]>
### What this PR does / why we need it? Backport main docs to make CI happy: ``` cp -r ../vllm-ascend-main/docs ./ cp ../vllm-ascend-main/README* ./ cp ../vllm-ascend-main/.readthedocs.yaml ./ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed ``` cp -r ../vllm-ascend-main/docs ./ cp ../vllm-ascend-main/README* ./ cp ../vllm-ascend-main/.readthedocs.yaml ./ ``` no diff Signed-off-by: Yikun Jiang <[email protected]> Signed-off-by: angazenn <[email protected]>
### What this PR does / why we need it? Backport vllm-project#64 to v0.7.1-dev branch Add container image build ci: - Enable branch, tag docker image publish - branch image: `vllm-ascend:main`, `vllm-ascend:v0.7.1-dev` - tag image: `vllm-ascend:v0.7.1rc1` - Enable PR docker image build check - other changes: - Prepare the `REPO_OWNER` because the ghcr lowerercase required - Add `Free up disk space` step to avoid `No space left on device` like vllm-project#27 - Setup qemu with image to resolve docker/setup-qemu-action#198 ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? build: CI passed --------- Signed-off-by: Yikun Jiang <[email protected]> Signed-off-by: angazenn <[email protected]>
### What this PR does / why we need it? Add vllm-ascend tutorials for v0.7.1. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. Signed-off-by: Shanshan Shen <[email protected]> Signed-off-by: angazenn <[email protected]>
### What this PR does / why we need it? Refeactor installation doc backport vllm-project#80 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI, preview Signed-off-by: Yikun Jiang <[email protected]> Signed-off-by: angazenn <[email protected]>
This PR add attention quantization interfaces, including AscendQKVQuantAttentionMethod class inherited from BaseKVCacheMethod class. --------- Signed-off-by: angazenn <[email protected]> Co-authored-by: angazenn <[email protected]> Signed-off-by: angazenn <[email protected]>
### What this PR does / why we need it? Update tutorials. Backport vllm-project#79 ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? ci. Signed-off-by: Shanshan Shen <[email protected]> Signed-off-by: Yikun Jiang <[email protected]> Co-authored-by: Shanshan Shen <[email protected]> Signed-off-by: angazenn <[email protected]>
cherry-pick from vllm-project#59 Signed-off-by: wangxiyuan <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Signed-off-by: angazenn <[email protected]>
### What this PR does / why we need it? - Set default model to Qwen2.5-0.5B-Instruct in example - Remove Ultravox 0.3 because it is not tested currently Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>
Add npu implement for FusedMoE Signed-off-by: YHT <[email protected]> Co-authored-by: YHT <[email protected]> Signed-off-by: angazenn <[email protected]>
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> To adapt to the MLA structure of vLLM DeepSeek on Ascend hardware, write the AscendMLAAttentionBackendImpl class. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> Users can choose to set VLLM_MLA_DISABLE to 1 or 0 to disable or enable MLA. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: YHT <[email protected]> Co-authored-by: YHT <[email protected]> Signed-off-by: angazenn <[email protected]>
cherry-pick from vllm-project#90 Add dynamic version in docs Signed-off-by: Yikun Jiang <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Signed-off-by: angazenn <[email protected]>
) 1. Update CANN image name 2. Add pta install step 3. update vllm-ascend docker image name to ghcr 4. update quick_start to use vllm-ascend image directly. 5. fix `note` style cherry-pick from vllm-project#85 Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: angazenn <[email protected]>
…blems. (vllm-project#95) fix an accuracy problem caused by missing of value contiguous Signed-off-by: hw_whx <[email protected]> Co-authored-by: hw_whx <[email protected]> Signed-off-by: angazenn <[email protected]>
Final update for 0.7.1.rc1 release. 1. Update the version in doc 2. Update dockerfile Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: angazenn <[email protected]>
Update feature and model lists Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>
change docker registry to quay Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: angazenn <[email protected]>
update feature support plan Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>
Fix a bug caused by omitting to change the parameter name. Signed-off-by: YHT <[email protected]> Co-authored-by: YHT <[email protected]> Signed-off-by: angazenn <[email protected]>
Don't login docker registry in pull request. Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: angazenn <[email protected]>
update model list Signed-off-by: MengqingCao <[email protected]> Signed-off-by: angazenn <[email protected]>
Signed-off-by: angazenn <[email protected]>
Signed-off-by: angazenn <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ascend attention requires int8 kvcache dtype. It is used in initialization of CacheConfig:
if cache_config.cache_dtype == "auto": self.dtype = model_config.dtype else: self.dtype = STR_DTYPE_TO_TORCH_DTYPE[cache_config.cache_dtype]
STR_DTYPE_TO_TORCH_DTYPE is defined in vllm.utils:
STR_DTYPE_TO_TORCH_DTYPE = { "half": torch.half, "bfloat16": torch.bfloat16, "float": torch.float, "fp8": torch.uint8, "fp8_e4m3": torch.uint8, "fp8_e5m2": torch.uint8, }
Hence we need to update both cache_dtype and STR_DTYPE_TO_TORCH_DTYPE.