Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange Memory Consumption Phenomenon in vLLM #89

Open
whu-dft opened this issue Feb 18, 2025 · 3 comments
Open

Strange Memory Consumption Phenomenon in vLLM #89

whu-dft opened this issue Feb 18, 2025 · 3 comments

Comments

@whu-dft
Copy link

whu-dft commented Feb 18, 2025

Today when I test the inference of Qwen2.5-Math-7B-Instruct on one card (TP=PP=1) it reported the OOM error.

I'm curously why this happened because the weights of 7B model only occupy 14GB NPU memory, there is 50GB memory left. Then I found that the OOM could be solved when reduce gpu_memory_utilization from 0.96 to 0.8. I didn't understand this all even if I set the max_tokens to 1024 in LLM.

We know in the infer mode, the memory is mainly occupied by model weight, activations and KV cache. Searching on the doc of vllm, I found this:

        gpu_memory_utilization: The ratio (between 0 and 1) of GPU memory to
            reserve for the model weights, activations, and KV cache. Higher
            values will increase the KV cache size and thus improve the model's
            throughput. However, if the value is too high, it may cause out-of-
            memory (OOM) errors.

I was confued when I decrease the gpu_memory_utilization the problem solved. So I did some experiments:

  • 7B模型,max_tokens = 1K,这里考虑关闭cpu_offload_gb参数

    gpu_memory_utilization 权重占用显存 设置的显存阈值 总显存占用
    0.2 14GB 12.8GB OOM
    0.4 14GB 25.6GB 36.2GB
    0.6 14GB 38.4GB 48.3GB
    0.8 14GB 51.2GB 60.5GB
    0.9 14GB 57.6GB 63.3GB
    0.95 14GB 60.8GB OOM
  • 7B模型,max_tokens = 32K,这里考虑关闭cpu_offload_gb参数

    gpu_memory_utilization 权重占用显存 设置的显存阈值 总显存占用
    0.2 14GB 12.8GB OOM
    0.4 14GB 25.6GB 36.2GB
    0.6 14GB 38.4GB 48.3GB
    0.8 14GB 51.2GB 60.6GB

So my questions are:

  1. What actualy does gpu_memory_utilization mean? When I set the value, the real memory occupation usually is higher than the threshold.
  2. What is the additional memory used by vllm that is beyond the gpu_memory_utilization threshold ? about 10GB regardless of the gpu_memory_utilization and max_tokens.
  3. Why the memory occupation has nothing to do with the max_tokens ? Then the max_tokens is 32x bigger, the memory is unchanged.
@wangxiyuan
Copy link
Collaborator

wangxiyuan commented Feb 19, 2025

Sorry for the late reply. We'll reproduce it first. gpu_memory_utilization is used here to generate num_npu_blocks. It's maybe a bug here. @MengqingCao Please doubel check as well. Thanks.

@whu-dft
Copy link
Author

whu-dft commented Feb 19, 2025

Glad to see your reply! Hope to see the progress!

@MengqingCao
Copy link
Contributor

As the code @wangxiyuan linked above, gpu_memory_utilization is the fraction of NPU/GPU memory to use for the vLLM execution. I simply watch the npu occupation but could not reproduce your experimental results. Thus I could not answer why there are additional mem occupation in your situation, maybe you can check if there is another process using npu?

About max-tokens, it just refers to the length of output token you expect model to generate. The mem size of kv cache is determined by arg max_model_len, you can check it out at https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker.py#L478-L480

The mem occupation in my situation:

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants