Skip to content

Distributed inference of 405B model with GPUs of different VRAM #897

Closed Answered by Ying1123
hrukalive asked this question in Q&A
Discussion options

You must be logged in to vote

Now, multi-node tensor parallelism inference supports using different GPUs, but it will only use min(gpu.memory_capacity for gpu in all_gpus) memory per GPU. This is because tensor parallelism is a SPMD-style parallelism, so all GPUs should do the same thing. To ensure that all GPUs do the same thing, the GPUs with higher memory capacity can only use the same amount of GPU memory as the GPUs with lower memory capacity.

In your case, it will be similar to having 8 GPUs with 24 GB each, which corresponds to 8x24 = 192 GB of memory, which is not enough to run 405B int4.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by Ying1123
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants