单卡多机无法训练千问2.5 #970

1571859588 · 2024-12-09T07:00:06Z

运行指令：

CUDA_VISIBLE_DEVICES=1,2,3,4 NPROC_PER_NODE=4 xtuner train ./internlm2_chat_1_8b_dpo_full_copy.py

internlm2_chat_1_8b_dpo_full_copy.py

这里我在示例的基础上把数据集和模型改成了千问2.5_32b的模型

运行部分结果：

W1209 06:53:34.745000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 34366 closing signal SIGTERM
W1209 06:53:34.747000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 34367 closing signal SIGTERM
W1209 06:53:34.747000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 34368 closing signal SIGTERM
E1209 06:53:46.195000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 3 (pid: 34369) of binary: /mnt/public/conda/envs/xtuner/bin/python
Traceback (most recent call last):
  File "/mnt/public/conda/envs/xtuner/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-09_06:53:34
  host      : is-dahisl6olik7jjio-devmachine-0
  rank      : 3 (local_rank: 3)
  exitcode  : -9 (pid: 34369)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 34369
============================================================

输出的日志文件出错部分内容：

2024/12/09 02:59:17 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.

尝试了的方法

将max_length改成了1024和512仍然不行，不知道是不是OOM，但是我试过跑6张A100 80G的卡都是同样的问题
跑单卡，单卡直接报错内存不足，没有抛出上面torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 的信息（其实也合理，毕竟没有使用多卡，也就没有分布式一说了）

现在我怀疑是不是XTuner不支持Qwen2.5了，看了仓库其他Issue似乎都没有跟我类似的问题，而之前有人问支不支持qwen也没有人回复。。。
求各位大佬帮帮忙！！！

The text was updated successfully, but these errors were encountered:

siyuyuan · 2024-12-15T07:43:17Z

llama-3.1-8B-instruct有同样的问题

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

单卡多机无法训练千问2.5 #970

单卡多机无法训练千问2.5 #970

1571859588 commented Dec 9, 2024 •

edited

Loading

siyuyuan commented Dec 15, 2024

单卡多机无法训练千问2.5 #970

单卡多机无法训练千问2.5 #970

Comments

1571859588 commented Dec 9, 2024 • edited Loading

运行指令：

internlm2_chat_1_8b_dpo_full_copy.py

运行部分结果：

输出的日志文件出错部分内容：

尝试了的方法

siyuyuan commented Dec 15, 2024

1571859588 commented Dec 9, 2024 •

edited

Loading