Lightning in Jupyter notebook server on SLURM #16846

devtronslab · 2023-02-23T00:12:34Z

devtronslab
Feb 23, 2023

pytorch_lightning v1.9.2

Is there a way to bypass/disable automatic Slurm environment detection and instantiation? I'm starting a jupyter notebook server through a Slurm batch and the Trainer seems to be getting confused

Here are the various combinations of parameters and configurations that I have attempted with no success:

strategy="dp" in notebook or in a python training script run from terminal in server - works but pytorch only sees --cpus-per-task number of cpus, regardless of the value of --ntasks-per-node, so if I have 6 cpus per task and 8 tasks, then pytorch dataloader complains if I have more than num_workers=6

strategy='ddp' in a python training script run from terminal in server - hangs on INFO:lightning_fabric.utilities.distributed:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8. Doesn't matter how many devices are specified, attempts to spin up --ntasks-per-node number of threads

strategy='ddp_notebook' in notebook - complains that whichever port is randomly (or manually) chosen is already in use:

RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:21545 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:21545 (errno: 98 - Address already in use).

When I specify --ntasks instead of --ntasks-per-node in my slurm batch script, lightning complains about the Slurm environment, so I know that it's being called even though I haven't specified it as a plugin for Trainer initialization. I've tried overwriting the environment after the Trainer is instantiated, but get the same socket bind error, so my only guess is that I need to prevent SLURMEnvironment from being called at all

Answered by devtronslab

Feb 23, 2023

I figured it out. In case anyone else is attempting to follow me into this soul-sucking morass of frustration here's how you solve it.

If you're attempting to run a jupyter notebook server on a slurm-provisioned instance and use lightning with strategy ddp_notebook:

in sbatch, salloc, or srun
- nodes=1, I'm not sure it's possible to run a jupyter notebook server distributed over multiple nodes, so this is 1
- cpus-per-task=<total number of cpus desired>, pytorch dataloader will only be able to see this many processors, not this value times the number of tasks. So, if you have 4 tasks and 5 cpus-per-task, pytorch will only utilize 5 cpus
- ntasks-per-node=1, setting this to 1 doesn't seem to a…

View full answer

devtronslab · 2023-02-23T23:02:50Z

devtronslab
Feb 23, 2023
Author

I figured it out. In case anyone else is attempting to follow me into this soul-sucking morass of frustration here's how you solve it.

If you're attempting to run a jupyter notebook server on a slurm-provisioned instance and use lightning with strategy ddp_notebook:

in sbatch, salloc, or srun
- nodes=1, I'm not sure it's possible to run a jupyter notebook server distributed over multiple nodes, so this is 1
- cpus-per-task=<total number of cpus desired>, pytorch dataloader will only be able to see this many processors, not this value times the number of tasks. So, if you have 4 tasks and 5 cpus-per-task, pytorch will only utilize 5 cpus
- ntasks-per-node=1, setting this to 1 doesn't seem to affect anything, and as mentioned above cpus-per-task needs to be the total number of cpus, so this needs to be 1
- my system uses gres=gpu:<total number of gpus>, I haven't tried any other ways of specifying total number of gpus
in your notebook

trainer = pl.Trainer(accelerator='gpu', devices=NUM_GPUS, strategy='ddp_notebook', 
                     plugins=[pl.plugins.environments.LightningEnvironment()])

profit

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lightning in Jupyter notebook server on SLURM #16846

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Lightning in Jupyter notebook server on SLURM #16846

Uh oh!

Uh oh!

devtronslab Feb 23, 2023

Replies: 1 comment

Uh oh!

devtronslab Feb 23, 2023 Author

devtronslab
Feb 23, 2023

devtronslab
Feb 23, 2023
Author