Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exits with return code = -9 #57

Open
sunshineyg2018 opened this issue Jun 24, 2023 · 1 comment
Open

exits with return code = -9 #57

sunshineyg2018 opened this issue Jun 24, 2023 · 1 comment

Comments

@sunshineyg2018
Copy link

显卡内存
80 GB

可使用内存
120 GB

也是提醒 exits with return code = -9

配置
deepspeed
--include="localhost:0"
./train_sft.py
--deepspeed ./ds_config/ds_config_zero3.json
--model_name_or_path /root/base_model
--train_file_path /root/data
--do_train
--output_dir /root/output
--overwrite_output_dir
--preprocess_num_workers 8
--num_train_epochs 800
--learning_rate 1e-5
--evaluation_strategy steps
--eval_steps 100
--bf16 True
--save_strategy steps
--save_steps 400
--save_total_limit 2
--logging_steps 10
--tf32 True
--per_device_train_batch_size 8
--per_device_eval_batch_size 8

@i4never
Copy link
Contributor

i4never commented Jul 3, 2023

可以在启动脚本的同时,watch -n 1 'free -h' 观察一下内存情况,如果内存耗尽,可以检察是不是数据量太大导致的问题。如果显存耗尽,考虑调小--per_device_train_batch_size,同时参考https://huggingface.co/docs/transformers/main_classes/deepspeed#how-to-choose-which-zero-stage-and-offloads-to-use-for-best-performance推荐的方式修改deepspeed配置。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants