exits with return code = -9 #57

sunshineyg2018 · 2023-06-24T07:43:15Z

显卡内存
80 GB

可使用内存
120 GB

也是提醒 exits with return code = -9

配置
deepspeed
--include="localhost:0"
./train_sft.py
--deepspeed ./ds_config/ds_config_zero3.json
--model_name_or_path /root/base_model
--train_file_path /root/data
--do_train
--output_dir /root/output
--overwrite_output_dir
--preprocess_num_workers 8
--num_train_epochs 800
--learning_rate 1e-5
--evaluation_strategy steps
--eval_steps 100
--bf16 True
--save_strategy steps
--save_steps 400
--save_total_limit 2
--logging_steps 10
--tf32 True
--per_device_train_batch_size 8
--per_device_eval_batch_size 8

i4never · 2023-07-03T03:56:51Z

可以在启动脚本的同时，watch -n 1 'free -h' 观察一下内存情况，如果内存耗尽，可以检察是不是数据量太大导致的问题。如果显存耗尽，考虑调小--per_device_train_batch_size，同时参考https://huggingface.co/docs/transformers/main_classes/deepspeed#how-to-choose-which-zero-stage-and-offloads-to-use-for-best-performance推荐的方式修改deepspeed配置。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exits with return code = -9 #57

exits with return code = -9 #57

sunshineyg2018 commented Jun 24, 2023

i4never commented Jul 3, 2023

exits with return code = -9 #57

exits with return code = -9 #57

Comments

sunshineyg2018 commented Jun 24, 2023

i4never commented Jul 3, 2023