Distributed inference of 405B model with GPUs of different VRAM #897

hrukalive · 2024-08-02T18:29:06Z

hrukalive
Aug 2, 2024

I am having difficulty finding information regarding distributed inference with two machines that have different GPUs. One has four RTX6000 (48G VRAM) and the second has four A5000 (24G VRAM). Llama 3.1 405B (INT4 AWQ) cannot fit on either one, so I am trying to use both. I was able to start the program on both machines and NCCL communication has no problem. However, the second machine (4x24G) immediately reported CUDA OOM.

Is there a way to configure the split or something similar? Or is this configuration not supported yet?

Answered by Ying1123

Aug 5, 2024

Now, multi-node tensor parallelism inference supports using different GPUs, but it will only use min(gpu.memory_capacity for gpu in all_gpus) memory per GPU. This is because tensor parallelism is a SPMD-style parallelism, so all GPUs should do the same thing. To ensure that all GPUs do the same thing, the GPUs with higher memory capacity can only use the same amount of GPU memory as the GPUs with lower memory capacity.

In your case, it will be similar to having 8 GPUs with 24 GB each, which corresponds to 8x24 = 192 GB of memory, which is not enough to run 405B int4.

View full answer

Ying1123 · 2024-08-05T00:06:54Z

Ying1123
Aug 5, 2024
Maintainer

Now, multi-node tensor parallelism inference supports using different GPUs, but it will only use min(gpu.memory_capacity for gpu in all_gpus) memory per GPU. This is because tensor parallelism is a SPMD-style parallelism, so all GPUs should do the same thing. To ensure that all GPUs do the same thing, the GPUs with higher memory capacity can only use the same amount of GPU memory as the GPUs with lower memory capacity.

In your case, it will be similar to having 8 GPUs with 24 GB each, which corresponds to 8x24 = 192 GB of memory, which is not enough to run 405B int4.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed inference of 405B model with GPUs of different VRAM #897

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Distributed inference of 405B model with GPUs of different VRAM #897

hrukalive Aug 2, 2024

Replies: 1 comment

Ying1123 Aug 5, 2024 Maintainer

hrukalive
Aug 2, 2024

Ying1123
Aug 5, 2024
Maintainer