Error in tqdm when using SLURM cluster and multiple GPUs. #15854

rmchurch · 2022-11-28T23:33:03Z

rmchurch
Nov 28, 2022

I am trying to run Pytorch Lightning on a SLURM cluster, with 4 GPUs per node. I setup my SLURM script to run on a single node, with 4 processes (1 GPU per process, see script below). I get an error from the tqdm progress bar, that only pops up when doing multi-gpu training (works ok for single gpu):

  File "/scratch/conda-envs/torch1.12.1/lib/python3.9/site-packages/tqdm/std.py", line 1370, in reset
    self.last_print_t = self.start_t = self._time()
AttributeError: 'Tqdm' object has no attribute '_time'

I have tried using the Rich progress bar instead, but get no output. This seems to be a known tqdm issue (tqdm/tqdm#624), but it seems several people can run ddp with Lightning on slurm clusters, so I assume it may be a cluster specific to mine issue, or there are other workarounds? I would be OK with the progress bar output being simply logged to the slurm output file.

pytorch-lightning 1.6.5
tqdm 4.49.0

#!/bin/bash                          
                  
#SBATCH --nodes=1             
#SBATCH --ntasks-per-node=4                                   
#SBATCH --ntasks-per-socket=2
#SBATCH --gpus-per-node=4                              
#SBATCH -t 01:00:00               

srun python -u mymain.py

mymain.py

def main():
  trainloader,valloader = MyDataLoaders()
  model = MyModule()
  
  trainer = pl.Trainer(gpus=4, num_nodes=1, strategy="ddp")
  
  trainer.fit(model,trainloader,valloader)
if __name__ == "__main__":
  main()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error in tqdm when using SLURM cluster and multiple GPUs. #15854

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Error in tqdm when using SLURM cluster and multiple GPUs. #15854

Uh oh!

rmchurch Nov 28, 2022

Replies: 0 comments

rmchurch
Nov 28, 2022