DDP fails on multinode if each node has a different number of GPUS #15695

SerezD · 2022-11-16T09:37:10Z

SerezD
Nov 16, 2022

Hi there,

I'm launching a PyTorch-Lightning script in a multinode environment.
In order to do so, I have followed the suggestions at this link:
https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_intermediate_1.html

I have configured a bash script that gets the number of nodes assigned and the number of GPUS on each node, as well as the MASTER NODE, MASTER ADDRESS, etc...

In the bash script I launch the mpirun command on every node, like:

mpirun -np 1 -H $NODE $EXEC $SCRIPT &
where $NODE is the current node, $EXEC is the python3 command, and $SCRIPT is my python script.

In the python script (which is launched N nodes times), I correctly assign the variables:

os.environ["MASTER_ADDR"] = args.master
os.environ["MASTER_PORT"] = args.port
os.environ["WORLD_SIZE"] = args.world
os.environ["NODE_RANK"] = args.rank

Then, in the trainer:

trainer = pl.Trainer(strategy=DDPStrategy(find_unused_parameters=False), accelerator='gpu', devices=gpus, num_nodes=world_size)

Everything works fine if the Scheduler assigns the same number of GPUS at each Node.
(2 GPUS on 4 NODES in the following example, with 8 gpus in total)

# Note: some prints are from my bash script
assigned nodes: gnode10 gnode16 gnode31 gnode32
assigned gpus per node: 2 2 2 2
WORLD_SIZE 4
MASTER gnode10
PORT 11551
bash script here! Waiting for all jobs to finish...
#############################################################################
# Here start the 4 scripts launched by mpirun (in this case WORLD SIZE = 4)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-e2d7ffc6-fe40-bbff-cd8b-77571d64b1bc,GPU-ec7f5880-fe70-668a-2fe9-6b51118ff2ab]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-e2d7ffc6-fe40-bbff-cd8b-77571d64b1bc,GPU-ec7f5880-fe70-668a-2fe9-6b51118ff2ab]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-80c32807-6dbf-8b68-4740-086cece4b91a,GPU-0148835f-d43f-b272-72d2-38a6dea989d2]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-80c32807-6dbf-8b68-4740-086cece4b91a,GPU-0148835f-d43f-b272-72d2-38a6dea989d2]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-7c411134-8399-b08e-2f07-d45fdbff38c5,GPU-ce3d447b-109b-8294-c7a3-d4b1b36ede2e]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-d7ff4b7a-730d-7ada-6fec-40ba709c649a,GPU-2774936a-d980-022c-69a5-ea4a6e081e6f]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-d7ff4b7a-730d-7ada-6fec-40ba709c649a,GPU-2774936a-d980-022c-69a5-ea4a6e081e6f]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-7c411134-8399-b08e-2f07-d45fdbff38c5,GPU-ce3d447b-109b-8294-c7a3-d4b1b36ede2e]

# Then Training starts correctly.

The problem is when the Scheduler assigns a different number of GPUS on each node, which may of course happen very often since I am not the only person using the cluster.
In the following example, 4 GPUS are assigned to 3 NODES (1, 2, 1).

# Note: some prints are from my bash script
assigned nodes: gnode41 gnode54 gnode60
assigned gpus per node: 1 2 1
WORLD_SIZE 3
MASTER gnode41
PORT 25212
bash script here! Waiting for all jobs to finish...
##############################################################################
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/6
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/6
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
----------------------------------------------------------------------------------------------------
# just hanging forever...

As you can see, the GLOBAL_RANK and MEMBER assigned are wrong.
In particular, for the MEMBER the node with 2 gpus assumes a total number of processes of 6 (2 times WORLD SIZE), while the nodes with one GPU assigned assume a total number of processes of 3 (1 times WORLD_SIZE).

Does anyone knows how to solve this ?
Is it a problem with DDP ?

Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDP fails on multinode if each node has a different number of GPUS #15695

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

DDP fails on multinode if each node has a different number of GPUS #15695

Uh oh!

SerezD Nov 16, 2022

Replies: 0 comments

SerezD
Nov 16, 2022