Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单卡多机无法训练千问2.5 #970

Open
1571859588 opened this issue Dec 9, 2024 · 1 comment
Open

单卡多机无法训练千问2.5 #970

1571859588 opened this issue Dec 9, 2024 · 1 comment

Comments

@1571859588
Copy link

1571859588 commented Dec 9, 2024

运行指令:

CUDA_VISIBLE_DEVICES=1,2,3,4 NPROC_PER_NODE=4 xtuner train ./internlm2_chat_1_8b_dpo_full_copy.py

internlm2_chat_1_8b_dpo_full_copy.py

这里我在示例的基础上把数据集和模型改成了千问2.5_32b的模型

运行部分结果:

W1209 06:53:34.745000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 34366 closing signal SIGTERM
W1209 06:53:34.747000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 34367 closing signal SIGTERM
W1209 06:53:34.747000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 34368 closing signal SIGTERM
E1209 06:53:46.195000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 3 (pid: 34369) of binary: /mnt/public/conda/envs/xtuner/bin/python
Traceback (most recent call last):
  File "/mnt/public/conda/envs/xtuner/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-09_06:53:34
  host      : is-dahisl6olik7jjio-devmachine-0
  rank      : 3 (local_rank: 3)
  exitcode  : -9 (pid: 34369)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 34369
============================================================

输出的日志文件出错部分内容:

2024/12/09 02:59:17 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.

尝试了的方法

  1. 将max_length改成了1024和512仍然不行,不知道是不是OOM,但是我试过跑6张A100 80G的卡都是同样的问题
  2. 跑单卡,单卡直接报错内存不足,没有抛出上面torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 的信息(其实也合理,毕竟没有使用多卡,也就没有分布式一说了)

现在我怀疑是不是XTuner不支持Qwen2.5了,看了仓库其他Issue似乎都没有跟我类似的问题,而之前有人问支不支持qwen也没有人回复。。。
求各位大佬帮帮忙!!!

@siyuyuan
Copy link

llama-3.1-8B-instruct有同样的问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants