Error in tqdm when using SLURM cluster and multiple GPUs. #15854
Unanswered
rmchurch
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to run Pytorch Lightning on a SLURM cluster, with 4 GPUs per node. I setup my SLURM script to run on a single node, with 4 processes (1 GPU per process, see script below). I get an error from the tqdm progress bar, that only pops up when doing multi-gpu training (works ok for single gpu):
I have tried using the Rich progress bar instead, but get no output. This seems to be a known tqdm issue (tqdm/tqdm#624), but it seems several people can run ddp with Lightning on slurm clusters, so I assume it may be a cluster specific to mine issue, or there are other workarounds? I would be OK with the progress bar output being simply logged to the slurm output file.
pytorch-lightning 1.6.5
tqdm 4.49.0
mymain.py
Beta Was this translation helpful? Give feedback.
All reactions