Single-Node Multi-GPU Training Stuck #6509
Replies: 6 comments 6 replies
-
I saw the following warning: So i tested with The GPUs are at 100%: But nothing happens....what it can be? |
Beta Was this translation helpful? Give feedback.
-
Finally it works when I use Horovod + Gloo interface |
Beta Was this translation helpful? Give feedback.
-
Thank you for your effort! I also tried your 'horovod' code and succeeded, but I wanted to use 'ddp'. And I finally found this, #4471 (comment) 'rank_zero_only=True' in self.log() function to solve this issue. |
Beta Was this translation helpful? Give feedback.
-
Having a similar issue. Even using the example MNIST code on the home page README but changing to |
Beta Was this translation helpful? Give feedback.
-
@andrewssobral I'm having pretty much the same exact issue that you had. I'm wondering if you were running on a SLURM HPC? And if so, how did you install horovod? |
Beta Was this translation helpful? Give feedback.
-
I recommend moving the code if not os.path.exists("MNIST"):
wget.download("https://activeeon-public.s3.eu-west-2.amazonaws.com/datasets/MNIST.new.tar.gz", "MNIST.tar.gz")
tar = tarfile.open("MNIST.tar.gz", "r:gz")
tar.extractall()
tar.close() to the prepare_data method. Otherwise you run the risk of race condition or corrupted files as multiple workers attempt to download these files. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone!
I am trying to launch a single-node multi-gpu training script, but i don't get any warning/error message, and the script is stuck for long time, nothing occurs....screenshot below:
The script was launched in a multi-gpu node (4 GPUs Tesla K80), as you can see below:
nvidia-smi info header:
When the script is "running" , I have the following behavior in my nvtop:
I waited for several minutes (around 30min), and nothing happens, I still have the following output:
Please see below my source code:
I'm using the following setup ($ pip freeze):
Someone knows what's happening ? Something wrong in the source code? Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions