[E socket.cpp:922] [c10d] The client socket has timed out after 1800s #19257
Unanswered
kfoynt
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment 1 reply
-
Any news with this? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi
I am trying to train a model using 2 GPUs in 1 node on SLURM.
But I am getting the following error:
[E socket.cpp:922] [c10d] The client socket has timed out after 1800s while trying to connect to (watgpu108, 19747). Traceback (most recent call last): File "/u1/kfountou/two_iterations_of_compare_and_swap_seventh_attempt_LIGHTNING.py", line 681, in <module> main(args) File "/u1/kfountou/two_iterations_of_compare_and_swap_seventh_attempt_LIGHTNING.py", line 668, in main trainer.fit(lightning_model, my_dataset) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 947, in _run self.strategy.setup_environment() File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 147, in setup_environment self.setup_distributed() File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 198, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 290, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1141, in init_process_group store, rank, world_size = next(rendezvous_iterator) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 241, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store return TCPStore( ^^^^^^^^^ TimeoutError: The client socket has timed out after 1800s while trying to connect to (watgpu108, 19747). srun: error: watgpu108: task 1: Exited with exit code 1
Here is my sbatch file:
#SBATCH --nodes=1
#SBATCH --mem=96GB
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
source activate jupyter-server
export NCCL_P2P_DISABLE=1
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
srun python two_iterations_of_compare_and_swap_seventh_attempt_LIGHTNING.py
and my trainer
trainer = pl.Trainer(precision="bf16-mixed", accelerator="gpu", devices=2, num_nodes=1, strategy='ddp', max_epochs=100000)
I am very stuck on this. I have been googling and trying potential solutions for hours, but I still get the same problem.
I tried changing the backend to gloo, but I get the same issue.
Any help would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions