Strange Memory Consumption Phenomenon in vLLM #89

whu-dft · 2025-02-18T09:46:59Z

Today when I test the inference of Qwen2.5-Math-7B-Instruct on one card (TP=PP=1) it reported the OOM error.

I'm curously why this happened because the weights of 7B model only occupy 14GB NPU memory, there is 50GB memory left. Then I found that the OOM could be solved when reduce gpu_memory_utilization from 0.96 to 0.8. I didn't understand this all even if I set the max_tokens to 1024 in LLM.

We know in the infer mode, the memory is mainly occupied by model weight, activations and KV cache. Searching on the doc of vllm, I found this:

        gpu_memory_utilization: The ratio (between 0 and 1) of GPU memory to
            reserve for the model weights, activations, and KV cache. Higher
            values will increase the KV cache size and thus improve the model's
            throughput. However, if the value is too high, it may cause out-of-
            memory (OOM) errors.

I was confued when I decrease the gpu_memory_utilization the problem solved. So I did some experiments:

7B模型，max_tokens = 1K，这里考虑关闭cpu_offload_gb参数

gpu_memory_utilization	权重占用显存	设置的显存阈值	总显存占用
0.2	14GB	12.8GB	OOM
0.4	14GB	25.6GB	36.2GB
0.6	14GB	38.4GB	48.3GB
0.8	14GB	51.2GB	60.5GB
0.9	14GB	57.6GB	63.3GB
0.95	14GB	60.8GB	OOM

7B模型，max_tokens = 32K，这里考虑关闭cpu_offload_gb参数

gpu_memory_utilization 权重占用显存设置的显存阈值总显存占用

0.2 14GB 12.8GB OOM

0.4 14GB 25.6GB 36.2GB

0.6 14GB 38.4GB 48.3GB

0.8 14GB 51.2GB 60.6GB

So my questions are:

What actualy does gpu_memory_utilization mean? When I set the value, the real memory occupation usually is higher than the threshold.
What is the additional memory used by vllm that is beyond the gpu_memory_utilization threshold ? about 10GB regardless of the gpu_memory_utilization and max_tokens.
Why the memory occupation has nothing to do with the max_tokens ? Then the max_tokens is 32x bigger, the memory is unchanged.

The text was updated successfully, but these errors were encountered:

wangxiyuan · 2025-02-19T01:41:15Z

Sorry for the late reply. We'll reproduce it first. gpu_memory_utilization is used here to generate num_npu_blocks. It's maybe a bug here. @MengqingCao Please doubel check as well. Thanks.

whu-dft · 2025-02-19T05:00:03Z

Glad to see your reply！ Hope to see the progress！

MengqingCao · 2025-02-20T01:41:55Z

As the code @wangxiyuan linked above, gpu_memory_utilization is the fraction of NPU/GPU memory to use for the vLLM execution. I simply watch the npu occupation but could not reproduce your experimental results. Thus I could not answer why there are additional mem occupation in your situation, maybe you can check if there is another process using npu?

About max-tokens, it just refers to the length of output token you expect model to generate. The mem size of kv cache is determined by arg max_model_len, you can check it out at https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker.py#L478-L480

The mem occupation in my situation:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange Memory Consumption Phenomenon in vLLM #89

Strange Memory Consumption Phenomenon in vLLM #89

whu-dft commented Feb 18, 2025

wangxiyuan commented Feb 19, 2025 •

edited

Loading

whu-dft commented Feb 19, 2025

MengqingCao commented Feb 20, 2025

Strange Memory Consumption Phenomenon in vLLM #89

Strange Memory Consumption Phenomenon in vLLM #89

Comments

whu-dft commented Feb 18, 2025

wangxiyuan commented Feb 19, 2025 • edited Loading

whu-dft commented Feb 19, 2025

MengqingCao commented Feb 20, 2025

wangxiyuan commented Feb 19, 2025 •

edited

Loading