You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today when I test the inference of Qwen2.5-Math-7B-Instruct on one card (TP=PP=1) it reported the OOM error.
I'm curously why this happened because the weights of 7B model only occupy 14GB NPU memory, there is 50GB memory left. Then I found that the OOM could be solved when reduce gpu_memory_utilization from 0.96 to 0.8. I didn't understand this all even if I set the max_tokens to 1024 in LLM.
We know in the infer mode, the memory is mainly occupied by model weight, activations and KV cache. Searching on the doc of vllm, I found this:
gpu_memory_utilization: The ratio (between 0 and 1) of GPU memory to
reserve for the model weights, activations, and KV cache. Higher
values will increase the KV cache size and thus improve the model's throughput. However, if the value is too high, it may cause out-of- memory (OOM) errors.
I was confued when I decrease the gpu_memory_utilization the problem solved. So I did some experiments:
7B模型,max_tokens = 1K,这里考虑关闭cpu_offload_gb参数
gpu_memory_utilization
权重占用显存
设置的显存阈值
总显存占用
0.2
14GB
12.8GB
OOM
0.4
14GB
25.6GB
36.2GB
0.6
14GB
38.4GB
48.3GB
0.8
14GB
51.2GB
60.5GB
0.9
14GB
57.6GB
63.3GB
0.95
14GB
60.8GB
OOM
7B模型,max_tokens = 32K,这里考虑关闭cpu_offload_gb参数
gpu_memory_utilization
权重占用显存
设置的显存阈值
总显存占用
0.2
14GB
12.8GB
OOM
0.4
14GB
25.6GB
36.2GB
0.6
14GB
38.4GB
48.3GB
0.8
14GB
51.2GB
60.6GB
So my questions are:
What actualy does gpu_memory_utilization mean? When I set the value, the real memory occupation usually is higher than the threshold.
What is the additional memory used by vllm that is beyond the gpu_memory_utilization threshold ? about 10GB regardless of the gpu_memory_utilization and max_tokens.
Why the memory occupation has nothing to do with the max_tokens ? Then the max_tokens is 32x bigger, the memory is unchanged.
The text was updated successfully, but these errors were encountered:
Sorry for the late reply. We'll reproduce it first. gpu_memory_utilization is used here to generate num_npu_blocks. It's maybe a bug here. @MengqingCao Please doubel check as well. Thanks.
As the code @wangxiyuan linked above, gpu_memory_utilization is the fraction of NPU/GPU memory to use for the vLLM execution. I simply watch the npu occupation but could not reproduce your experimental results. Thus I could not answer why there are additional mem occupation in your situation, maybe you can check if there is another process using npu?
Today when I test the inference of Qwen2.5-Math-7B-Instruct on one card (TP=PP=1) it reported the OOM error.
I'm curously why this happened because the weights of 7B model only occupy 14GB NPU memory, there is 50GB memory left. Then I found that the OOM could be solved when reduce
gpu_memory_utilization
from 0.96 to 0.8. I didn't understand this all even if I set the max_tokens to 1024 in LLM.We know in the infer mode, the memory is mainly occupied by model weight, activations and KV cache. Searching on the doc of vllm, I found this:
I was confued when I decrease the
gpu_memory_utilization
the problem solved. So I did some experiments:7B模型,max_tokens = 1K,这里考虑关闭
cpu_offload_gb
参数7B模型,max_tokens = 32K,这里考虑关闭
cpu_offload_gb
参数So my questions are:
gpu_memory_utilization
mean? When I set the value, the real memory occupation usually is higher than the threshold.gpu_memory_utilization
threshold ? about 10GB regardless of thegpu_memory_utilization
andmax_tokens
.max_tokens
? Then themax_tokens
is 32x bigger, the memory is unchanged.The text was updated successfully, but these errors were encountered: