-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Open
Description
I perform exactly as the ClevrCount task in 多模态GRPO完整实验过程.
However, after 90 steps, the training procedure corrupted.
- Environment:
- pip install from the newest ms-swift code.
- The vllm Client output:
[rank0]:[W710 16:45:37.933498015 socket.cpp:464] [c10d] waitForInput: poll for socket SocketImpl(fd=111, addr=[localhost]:42404, remote=[::ffff:0.0.0.0]:51216) returned 0, likely a timeout
[rank0]:[W710 16:53:50.544805244 socket.cpp:489] [c10d] waitForInput: socket SocketImpl(fd=111, addr=[localhost]:42404, remote=[::ffff:0.0.0.0]:51216) timed out after 300000ms
ERROR 07-10 16:53:50 [core.py:459] Invocation of collective_rpc method failed
ERROR 07-10 16:53:50 [core.py:459] Traceback (most recent call last):
ERROR 07-10 16:53:50 [core.py:459] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 456, in _handle_client_request
ERROR 07-10 16:53:50 [core.py:459] output.result = method(
ERROR 07-10 16:53:50 [core.py:459] ^^^^^^^
ERROR 07-10 16:53:50 [core.py:459] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 306, in collective_rpc
ERROR 07-10 16:53:50 [core.py:459] return self.model_executor.collective_rpc(method, timeout, args,
ERROR 07-10 16:53:50 [core.py:459] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 16:53:50 [core.py:459] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 07-10 16:53:50 [core.py:459] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-10 16:53:50 [core.py:459] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 16:53:50 [core.py:459] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/utils.py", line 2456, in run_method
ERROR 07-10 16:53:50 [core.py:459] return func(*args, **kwargs)
ERROR 07-10 16:53:50 [core.py:459] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 16:53:50 [core.py:459] File "/home/tiger/.local/lib/python3.11/site-packages/trl/scripts/vllm_serve.py", line 134, in update_named_param
ERROR 07-10 16:53:50 [core.py:459] self.pynccl_comm.group.barrier()
ERROR 07-10 16:53:50 [core.py:459] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/distributed/utils.py", line 216, in barrier
ERROR 07-10 16:53:50 [core.py:459] self.broadcast_obj(None, src=i)
ERROR 07-10 16:53:50 [core.py:459] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/distributed/utils.py", line 197, in broadcast_obj
ERROR 07-10 16:53:50 [core.py:459] recv_obj = pickle.loads(self.store.get(key))
ERROR 07-10 16:53:50 [core.py:459] ^^^^^^^^^^^^^^^^^^^
ERROR 07-10 16:53:50 [core.py:459] torch.distributed.DistStoreError: wait timeout after 300000ms, keys: /broadcast_from/1/75463
INFO: 127.0.0.1:54708 - "POST /update_named_param/ HTTP/1.1" 200 OK
- output
{'loss': 6.333e-05, 'grad_norm': 1.42012046, 'learning_rate': 5.1e-07, 'memory(GiB)': 27.98, 'train_speed(iter/s)': 0.012482, 'completions/mean_length': 94.11458588, 'completions/min_length': 54.0, 'completions/max_length': 187.0, 'completions/clipped_ratio': 0.0, 'rewards/MultiModalAccuracyORM/mean': 0.71875, 'rewards/MultiModalAccuracyORM/std': 0.45196936, 'rewards/Format/mean': 1.0, 'rewards/Format/std': 0.0, 'reward': 1.71875, 'reward_std': 0.3573038, 'kl': 0.06140137, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.01, 'global_step/max_steps': '89/17500', 'percentage': '0.51%', 'elapsed_time': '1h 58m 36s', 'remaining_time': '16d 2h 41m 45s'}
{'loss': 6.617e-05, 'grad_norm': 1.40871879, 'learning_rate': 5.1e-07, 'memory(GiB)': 27.98, 'train_speed(iter/s)': 0.012488, 'completions/mean_length': 107.11458588, 'completions/min_length': 62.0, 'completions/max_length': 221.0, 'completions/clipped_ratio': 0.0, 'rewards/MultiModalAccuracyORM/mean': 0.54166669, 'rewards/MultiModalAccuracyORM/std': 0.50087643, 'rewards/Format/mean': 1.0, 'rewards/Format/std': 0.0, 'reward': 1.54166675, 'reward_std': 0.36645633, 'kl': 0.06262207, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.01, 'global_step/max_steps': '90/17500', 'percentage': '0.51%', 'elapsed_time': '1h 59m 52s', 'remaining_time': '16d 2h 30m 37s'}
{'loss': 7.666e-05, 'grad_norm': 1.56475962, 'learning_rate': 5.2e-07, 'memory(GiB)': 27.98, 'train_speed(iter/s)': 0.012491, 'completions/mean_length': 88.54167175, 'completions/min_length': 60.0, 'completions/max_length': 151.0, 'completions/clipped_ratio': 0.0, 'rewards/MultiModalAccuracyORM/mean': 0.76041669, 'rewards/MultiModalAccuracyORM/std': 0.42906979, 'rewards/Format/mean': 1.0, 'rewards/Format/std': 0.0, 'reward': 1.76041675, 'reward_std': 0.40977556, 'kl': 0.07421875, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.01, 'global_step/max_steps': '91/17500', 'percentage': '0.52%', 'elapsed_time': '2h 1m 11s', 'remaining_time': '16d 2h 23m 29s'}
Train: 1%|▋ | 91/17500 [2:01:11<379:29:29, 78.47s/it][rank0]:[W710 16:58:50.903217598 socket.cpp:464] [c10d] waitForInput: poll for socket SocketImpl(fd=58, addr=[localhost]:42412, remote=[localhost]:51216) returned 0, likely a timeout
[rank0]:[W710 16:58:50.906560343 socket.cpp:489] [c10d] waitForInput: socket SocketImpl(fd=58, addr=[localhost]:42412, remote=[localhost]:51216) timed out after 300000ms
[INFO:swift] last_model_checkpoint: None
[INFO:swift] best_model_checkpoint: None
Traceback (most recent call last):
File "/opt/tiger/vlm/ms-swift/swift/cli/rlhf.py", line 5, in <module>
rlhf_main()
File "/opt/tiger/vlm/ms-swift/swift/llm/train/rlhf.py", line 172, in rlhf_main
return SwiftRLHF(args).main()
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/vlm/ms-swift/swift/llm/base.py", line 49, in main
result = self.run()
^^^^^^^^^^
File "/opt/tiger/vlm/ms-swift/swift/llm/train/sft.py", line 122, in run
return self.train(trainer)
^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/vlm/ms-swift/swift/llm/train/sft.py", line 183, in train
trainer.train(trainer.args.resume_from_checkpoint)
File "/opt/tiger/vlm/ms-swift/swift/trainers/mixin.py", line 419, in train
res = super().train(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 1393, in training_step
return super().training_step(model, inputs, num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/transformers/trainer.py", line 3730, in training_step
inputs = self._prepare_inputs(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/trl/extras/profiling.py", line 98, in wrapper
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 348, in _prepare_inputs
generation_batch = self._generate_and_score_completions(generation_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 873, in _generate_and_score_completions
inputs = self._generate_completions(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 854, in _generate_completions
inputs, outputs = self._fast_infer(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 806, in _fast_infer
self._move_model_to_vllm()
File "/home/tiger/.local/lib/python3.11/site-packages/trl/extras/profiling.py", line 98, in wrapper
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 535, in _move_model_to_vllm
self.vllm_client.update_named_param(name, param.data)
File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/vllm_client.py", line 200, in update_named_param
self.pynccl_comm.group.barrier()
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/distributed/utils.py", line 216, in barrier
self.broadcast_obj(None, src=i)
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/distributed/utils.py", line 197, in broadcast_obj
recv_obj = pickle.loads(self.store.get(key))
^^^^^^^^^^^^^^^^^^^
torch.distributed.DistStoreError: wait timeout after 300000ms, keys: /broadcast_from/0/75465
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/tiger/vlm/ms-swift/swift/cli/rlhf.py", line 5, in <module>
[rank0]: rlhf_main()
[rank0]: File "/opt/tiger/vlm/ms-swift/swift/llm/train/rlhf.py", line 172, in rlhf_main
[rank0]: return SwiftRLHF(args).main()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/tiger/vlm/ms-swift/swift/llm/base.py", line 49, in main
[rank0]: result = self.run()
[rank0]: ^^^^^^^^^^
[rank0]: File "/opt/tiger/vlm/ms-swift/swift/llm/train/sft.py", line 122, in run
[rank0]: return self.train(trainer)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/tiger/vlm/ms-swift/swift/llm/train/sft.py", line 183, in train
[rank0]: trainer.train(trainer.args.resume_from_checkpoint)
[rank0]: File "/opt/tiger/vlm/ms-swift/swift/trainers/mixin.py", line 419, in train
[rank0]: res = super().train(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 1393, in training_step
[rank0]: return super().training_step(model, inputs, num_items_in_batch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/transformers/trainer.py", line 3730, in training_step
[rank0]: inputs = self._prepare_inputs(inputs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/trl/extras/profiling.py", line 98, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 348, in _prepare_inputs
[rank0]: generation_batch = self._generate_and_score_completions(generation_batch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 873, in _generate_and_score_completions
[rank0]: inputs = self._generate_completions(inputs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 854, in _generate_completions
[rank0]: inputs, outputs = self._fast_infer(inputs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 806, in _fast_infer
[rank0]: self._move_model_to_vllm()
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/trl/extras/profiling.py", line 98, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 535, in _move_model_to_vllm
[rank0]: self.vllm_client.update_named_param(name, param.data)
[rank0]: File "/opt/tiger/vlm/ms-swift/swift/trainers/rlhf_trainer/vllm_client.py", line 200, in update_named_param
[rank0]: self.pynccl_comm.group.barrier()
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/distributed/utils.py", line 216, in barrier
[rank0]: self.broadcast_obj(None, src=i)
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/distributed/utils.py", line 197, in broadcast_obj
[rank0]: recv_obj = pickle.loads(self.store.get(key))
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistStoreError: wait timeout after 300000ms, keys: /broadcast_from/0/75465
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
E0710 16:59:00.603000 1110465 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1110540) of binary: /usr/bin/python3
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/tiger/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 922, in <module>
main()
File "/home/tiger/.local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/home/tiger/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/home/tiger/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/opt/tiger/vlm/ms-swift/swift/cli/rlhf.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-07-10_16:58:59
host : n176-080-218.byted.org
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1110540)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Metadata
Metadata
Assignees
Labels
No labels