Skip to content

Timeout when applying GRPO on VLM after a few steps #690

@Angericky

Description

@Angericky

I perform exactly as the ClevrCount task in 多模态GRPO完整实验过程.

However, after 90 steps, the training procedure corrupted.

  1. Environment:
  • pip install from the newest ms-swift code.
  1. The vllm Client output:
[rank0]:[W710 16:45:37.933498015 socket.cpp:464] [c10d] waitForInput: poll for socket SocketImpl(fd=111, addr=[localhost]:42404, remote=[::ffff:0.0.0.0]:51216) returned 0, likely a timeout
[rank0]:[W710 16:53:50.544805244 socket.cpp:489] [c10d] waitForInput: socket SocketImpl(fd=111, addr=[localhost]:42404, remote=[::ffff:0.0.0.0]:51216) timed out after 300000ms
ERROR 07-10 16:53:50 [core.py:459] Invocation of collective_rpc method failed
ERROR 07-10 16:53:50 [core.py:459] Traceback (most recent call last):
ERROR 07-10 16:53:50 [core.py:459]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 456, in _handle_client_request
ERROR 07-10 16:53:50 [core.py:459]     output.result = method(
ERROR 07-10 16:53:50 [core.py:459]                     ^^^^^^^
ERROR 07-10 16:53:50 [core.py:459]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 306, in collective_rpc
ERROR 07-10 16:53:50 [core.py:459]     return self.model_executor.collective_rpc(method, timeout, args,
ERROR 07-10 16:53:50 [core.py:459]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 16:53:50 [core.py:459]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 07-10 16:53:50 [core.py:459]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-10 16:53:50 [core.py:459]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 16:53:50 [core.py:459]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/utils.py", line 2456, in run_method
ERROR 07-10 16:53:50 [core.py:459]     return func(*args, **kwargs)
ERROR 07-10 16:53:50 [core.py:459]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 16:53:50 [core.py:459]   File "/home/tiger/.local/lib/python3.11/site-packages/trl/scripts/vllm_serve.py", line 134, in update_named_param
ERROR 07-10 16:53:50 [core.py:459]     self.pynccl_comm.group.barrier()
ERROR 07-10 16:53:50 [core.py:459]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/distributed/utils.py", line 216, in barrier
ERROR 07-10 16:53:50 [core.py:459]     self.broadcast_obj(None, src=i)
ERROR 07-10 16:53:50 [core.py:459]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/distributed/utils.py", line 197, in broadcast_obj
ERROR 07-10 16:53:50 [core.py:459]     recv_obj = pickle.loads(self.store.get(key))
ERROR 07-10 16:53:50 [core.py:459]                             ^^^^^^^^^^^^^^^^^^^
ERROR 07-10 16:53:50 [core.py:459] torch.distributed.DistStoreError: wait timeout after 300000ms, keys: /broadcast_from/1/75463
INFO:     127.0.0.1:54708 - "POST /update_named_param/ HTTP/1.1" 200 OK
  1. output
{'loss': 6.333e-05, 'grad_norm': 1.42012046, 'learning_rate': 5.1e-07, 'memory(GiB)': 27.98, 'train_speed(iter/s)': 0.012482, 'completions/mean_length': 94.11458588, 'completions/min_length': 54.0, 'completions/max_length': 187.0, 'completions/clipped_ratio': 0.0, 'rewards/MultiModalAccuracyORM/mean': 0.71875, 'rewards/MultiModalAccuracyORM/std': 0.45196936, 'rewards/Format/mean': 1.0, 'rewards/Format/std': 0.0, 'reward': 1.71875, 'reward_std': 0.3573038, 'kl': 0.06140137, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.01, 'global_step/max_steps': '89/17500', 'percentage': '0.51%', 'elapsed_time': '1h 58m 36s', 'remaining_time': '16d 2h 41m 45s'}
{'loss': 6.617e-05, 'grad_norm': 1.40871879, 'learning_rate': 5.1e-07, 'memory(GiB)': 27.98, 'train_speed(iter/s)': 0.012488, 'completions/mean_length': 107.11458588, 'completions/min_length': 62.0, 'completions/max_length': 221.0, 'completions/clipped_ratio': 0.0, 'rewards/MultiModalAccuracyORM/mean': 0.54166669, 'rewards/MultiModalAccuracyORM/std': 0.50087643, 'rewards/Format/mean': 1.0, 'rewards/Format/std': 0.0, 'reward': 1.54166675, 'reward_std': 0.36645633, 'kl': 0.06262207, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.01, 'global_step/max_steps': '90/17500', 'percentage': '0.51%', 'elapsed_time': '1h 59m 52s', 'remaining_time': '16d 2h 30m 37s'}
{'loss': 7.666e-05, 'grad_norm': 1.56475962, 'learning_rate': 5.2e-07, 'memory(GiB)': 27.98, 'train_speed(iter/s)': 0.012491, 'completions/mean_length': 88.54167175, 'completions/min_length': 60.0, 'completions/max_length': 151.0, 'completions/clipped_ratio': 0.0, 'rewards/MultiModalAccuracyORM/mean': 0.76041669, 'rewards/MultiModalAccuracyORM/std': 0.42906979, 'rewards/Format/mean': 1.0, 'rewards/Format/std': 0.0, 'reward': 1.76041675, 'reward_std': 0.40977556, 'kl': 0.07421875, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.01, 'global_step/max_steps': '91/17500', 'percentage': '0.52%', 'elapsed_time': '2h 1m 11s', 'remaining_time': '16d 2h 23m 29s'}
Train:   1%|▋                                                                                                                           | 91/17500 [2:01:11<379:29:29, 78.47s/it][rank0]:[W710 16:58:50.903217598 socket.cpp:464] [c10d] waitForInput: poll for socket SocketImpl(fd=58, addr=[localhost]:42412, remote=[localhost]:51216) returned 0, likely a timeout
[rank0]:[W710 16:58:50.906560343 socket.cpp:489] [c10d] waitForInput: socket SocketImpl(fd=58, addr=[localhost]:42412, remote=[localhost]:51216) timed out after 300000ms
[INFO:swift] last_model_checkpoint: None
[INFO:swift] best_model_checkpoint: None
Traceback (most recent call last):
  File "/opt/tiger/vlm/ms-swift/swift/cli/rlhf.py", line 5, in <module>
    rlhf_main()
  File "/opt/tiger/vlm/ms-swift/swift/llm/train/rlhf.py", line 172, in rlhf_main
    return SwiftRLHF(args).main()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/vlm/ms-swift/swift/llm/base.py", line 49, in main
    result = self.run()
             ^^^^^^^^^^
  File "/opt/tiger/vlm/ms-swift/swift/llm/train/sft.py", line 122, in run
    return self.train(trainer)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/vlm/ms-swift/swift/llm/train/sft.py", line 183, in train
    trainer.train(trainer.args.resume_from_checkpoint)
  File "/opt/tiger/vlm/ms-swift/swift/trainers/mixin.py", line 419, in train
    res = super().train(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 1393, in training_step
    return super().training_step(model, inputs, num_items_in_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/transformers/trainer.py", line 3730, in training_step
    inputs = self._prepare_inputs(inputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/trl/extras/profiling.py", line 98, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 348, in _prepare_inputs
    generation_batch = self._generate_and_score_completions(generation_batch)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 873, in _generate_and_score_completions
    inputs = self._generate_completions(inputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 854, in _generate_completions
    inputs, outputs = self._fast_infer(inputs)
                      ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 806, in _fast_infer
    self._move_model_to_vllm()
  File "/home/tiger/.local/lib/python3.11/site-packages/trl/extras/profiling.py", line 98, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 535, in _move_model_to_vllm
    self.vllm_client.update_named_param(name, param.data)
  File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/vllm_client.py", line 200, in update_named_param
    self.pynccl_comm.group.barrier()
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/distributed/utils.py", line 216, in barrier
    self.broadcast_obj(None, src=i)
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/distributed/utils.py", line 197, in broadcast_obj
    recv_obj = pickle.loads(self.store.get(key))
                            ^^^^^^^^^^^^^^^^^^^
torch.distributed.DistStoreError: wait timeout after 300000ms, keys: /broadcast_from/0/75465
[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/tiger/vlm/ms-swift/swift/cli/rlhf.py", line 5, in <module>
[rank0]:     rlhf_main()
[rank0]:   File "/opt/tiger/vlm/ms-swift/swift/llm/train/rlhf.py", line 172, in rlhf_main
[rank0]:     return SwiftRLHF(args).main()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/tiger/vlm/ms-swift/swift/llm/base.py", line 49, in main
[rank0]:     result = self.run()
[rank0]:              ^^^^^^^^^^
[rank0]:   File "/opt/tiger/vlm/ms-swift/swift/llm/train/sft.py", line 122, in run
[rank0]:     return self.train(trainer)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/tiger/vlm/ms-swift/swift/llm/train/sft.py", line 183, in train
[rank0]:     trainer.train(trainer.args.resume_from_checkpoint)
[rank0]:   File "/opt/tiger/vlm/ms-swift/swift/trainers/mixin.py", line 419, in train
[rank0]:     res = super().train(*args, **kwargs)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tiger/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tiger/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 1393, in training_step
[rank0]:     return super().training_step(model, inputs, num_items_in_batch)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tiger/.local/lib/python3.11/site-packages/transformers/trainer.py", line 3730, in training_step
[rank0]:     inputs = self._prepare_inputs(inputs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tiger/.local/lib/python3.11/site-packages/trl/extras/profiling.py", line 98, in wrapper
[rank0]:     return func(self, *args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 348, in _prepare_inputs
[rank0]:     generation_batch = self._generate_and_score_completions(generation_batch)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 873, in _generate_and_score_completions
[rank0]:     inputs = self._generate_completions(inputs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 854, in _generate_completions
[rank0]:     inputs, outputs = self._fast_infer(inputs)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 806, in _fast_infer
[rank0]:     self._move_model_to_vllm()
[rank0]:   File "/home/tiger/.local/lib/python3.11/site-packages/trl/extras/profiling.py", line 98, in wrapper
[rank0]:     return func(self, *args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 535, in _move_model_to_vllm
[rank0]:     self.vllm_client.update_named_param(name, param.data)
[rank0]:   File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/vllm_client.py", line 200, in update_named_param
[rank0]:     self.pynccl_comm.group.barrier()
[rank0]:   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/distributed/utils.py", line 216, in barrier
[rank0]:     self.broadcast_obj(None, src=i)
[rank0]:   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/distributed/utils.py", line 197, in broadcast_obj
[rank0]:     recv_obj = pickle.loads(self.store.get(key))
[rank0]:                             ^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistStoreError: wait timeout after 300000ms, keys: /broadcast_from/0/75465
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
E0710 16:59:00.603000 1110465 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1110540) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/tiger/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 922, in <module>
    main()
  File "/home/tiger/.local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/home/tiger/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/home/tiger/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/opt/tiger/vlm/ms-swift/swift/cli/rlhf.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-07-10_16:58:59
  host      : n176-080-218.byted.org
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1110540)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions