RuntimeError: unable to open shared memory object </torch_91130_1372465664> in read-write mode #8524
Answered
by
carmocca
EvanZ
asked this question in
DDP / multi-GPU / multi-node
-
I'm getting the following error after setting up an EC2 instance p3.8xlarge (so 4 GPUs) and setting gpus=4: /home/ubuntu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:524: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.
'You requested multiple GPUs but did not specify a backend, e.g.'
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Traceback (most recent call last):
File "train.py", line 79, in <module>
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/tuner/tuning.py", line 197, in lr_find
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 688, in tune
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/tuner/tuning.py", line 54, in _tune
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/tuner/lr_finder.py", line 250, in lr_find
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/tuner/tuning.py", line 64, in _run
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 122, in start_training
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes
File "/home/ubuntu/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
File "/home/ubuntu/anaconda3/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
File "/home/ubuntu/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
File "/home/ubuntu/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
File "/home/ubuntu/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
File "/home/ubuntu/anaconda3/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 328, in reduce_storage
RuntimeError: unable to open shared memory object </torch_91130_1372465664> in read-write mode My code runs fine on a single gpu instance. Any idea what I need to look at here? |
Beta Was this translation helpful? Give feedback.
Answered by
carmocca
Jul 23, 2021
Replies: 1 comment
-
Some quick googling 🔍 This issue is not Lightning related, so if the fixes mentioned there do not help, then you should try asking on PyTorch discussions. |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
carmocca
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Some quick googling 🔍
facebookresearch/maskrcnn-benchmark#103
This issue is not Lightning related, so if the fixes mentioned there do not help, then you should try asking on PyTorch discussions.