Lightning in Jupyter notebook server on SLURM #16846
-
pytorch_lightning v1.9.2 Is there a way to bypass/disable automatic Slurm environment detection and instantiation? I'm starting a jupyter notebook server through a Slurm batch and the Trainer seems to be getting confused Here are the various combinations of parameters and configurations that I have attempted with no success: strategy="dp" in notebook or in a python training script run from terminal in server - works but pytorch only sees strategy='ddp' in a python training script run from terminal in server - hangs on strategy='ddp_notebook' in notebook - complains that whichever port is randomly (or manually) chosen is already in use:
When I specify |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I figured it out. In case anyone else is attempting to follow me into this soul-sucking morass of frustration here's how you solve it. If you're attempting to run a jupyter notebook server on a slurm-provisioned instance and use lightning with strategy
|
Beta Was this translation helpful? Give feedback.
I figured it out. In case anyone else is attempting to follow me into this soul-sucking morass of frustration here's how you solve it.
If you're attempting to run a jupyter notebook server on a slurm-provisioned instance and use lightning with strategy
ddp_notebook
:sbatch
,salloc
, orsrun
nodes=1
, I'm not sure it's possible to run a jupyter notebook server distributed over multiple nodes, so this is 1cpus-per-task=<total number of cpus desired>
, pytorch dataloader will only be able to see this many processors, not this value times the number of tasks. So, if you have 4 tasks and 5 cpus-per-task, pytorch will only utilize 5 cpusntasks-per-node=1
, setting this to 1 doesn't seem to a…