-
I am having difficulty finding information regarding distributed inference with two machines that have different GPUs. One has four RTX6000 (48G VRAM) and the second has four A5000 (24G VRAM). Llama 3.1 405B (INT4 AWQ) cannot fit on either one, so I am trying to use both. I was able to start the program on both machines and NCCL communication has no problem. However, the second machine (4x24G) immediately reported CUDA OOM. Is there a way to configure the split or something similar? Or is this configuration not supported yet? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Now, multi-node tensor parallelism inference supports using different GPUs, but it will only use In your case, it will be similar to having 8 GPUs with 24 GB each, which corresponds to 8x24 = 192 GB of memory, which is not enough to run 405B int4. |
Beta Was this translation helpful? Give feedback.
Now, multi-node tensor parallelism inference supports using different GPUs, but it will only use
min(gpu.memory_capacity for gpu in all_gpus)
memory per GPU. This is because tensor parallelism is a SPMD-style parallelism, so all GPUs should do the same thing. To ensure that all GPUs do the same thing, the GPUs with higher memory capacity can only use the same amount of GPU memory as the GPUs with lower memory capacity.In your case, it will be similar to having 8 GPUs with 24 GB each, which corresponds to 8x24 = 192 GB of memory, which is not enough to run 405B int4.