We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
显卡内存 80 GB
可使用内存 120 GB
也是提醒 exits with return code = -9
配置 deepspeed --include="localhost:0" ./train_sft.py --deepspeed ./ds_config/ds_config_zero3.json --model_name_or_path /root/base_model --train_file_path /root/data --do_train --output_dir /root/output --overwrite_output_dir --preprocess_num_workers 8 --num_train_epochs 800 --learning_rate 1e-5 --evaluation_strategy steps --eval_steps 100 --bf16 True --save_strategy steps --save_steps 400 --save_total_limit 2 --logging_steps 10 --tf32 True --per_device_train_batch_size 8 --per_device_eval_batch_size 8
The text was updated successfully, but these errors were encountered:
可以在启动脚本的同时,watch -n 1 'free -h' 观察一下内存情况,如果内存耗尽,可以检察是不是数据量太大导致的问题。如果显存耗尽,考虑调小--per_device_train_batch_size,同时参考https://huggingface.co/docs/transformers/main_classes/deepspeed#how-to-choose-which-zero-stage-and-offloads-to-use-for-best-performance推荐的方式修改deepspeed配置。
watch -n 1 'free -h'
--per_device_train_batch_size
Sorry, something went wrong.
No branches or pull requests
显卡内存
80 GB
可使用内存
120 GB
也是提醒 exits with return code = -9
配置
deepspeed
--include="localhost:0"
./train_sft.py
--deepspeed ./ds_config/ds_config_zero3.json
--model_name_or_path /root/base_model
--train_file_path /root/data
--do_train
--output_dir /root/output
--overwrite_output_dir
--preprocess_num_workers 8
--num_train_epochs 800
--learning_rate 1e-5
--evaluation_strategy steps
--eval_steps 100
--bf16 True
--save_strategy steps
--save_steps 400
--save_total_limit 2
--logging_steps 10
--tf32 True
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
The text was updated successfully, but these errors were encountered: