Skip to content

LLaMA-7B SFT died with <Signals.SIGABRT: 6> #539

@PussyCat0700

Description

@PussyCat0700

配置:单卡A100
在Finetune时遇到SIGABRT: 6错误

  • 报错信息
  File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 240, in <module>
    main()
  File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 228, in main
    sigkill_handler(signal.SIGTERM, None)
  File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 196, in sigkill_handler
    raise subprocess.CalledProcessError(
subprocess.CalledProcessError: Command '['/home/yfliu/anaconda3/envs/oneflow/bin/python3', '-u', 'projects/Llama/train_net.py', '--config-file', 'projects/Llama/configs/llama_sft.py']' died with <Signals.SIGABRT: 6>.
  • 我的脚本
set -e
if [ -z "$1" ]; then
    echo "Usage: $0 <number>"
    exit 1
fi
libai_path=../libai
cd $libai_path
# scripts split in case blocks.
case $1 in
1)
# See https://github.com/Oneflow-Inc/libai/tree/main/projects/Llama for reference
# Notice:
# 1. Please make sure you have setup destination_path and checkpoint_dir
# For example, our checkpoint_dir is /data1/yfliu/models/LLaMA2/LLaMA2_hf_7B downloaded from https://llama.meta.com/llama-downloads/
# our destination dir is /data1/yfliu/alpaca
# 2. You should also modify terms in projects/Llama/configs/llama_config.py
python projects/Llama/utils/prepare_alpaca.py
;;
2)
# full finetune
# Please set the finetuning parameters in projects/Llama/configs/llama_sft.py, such as dataset_path and pretrained_model_path
# Type python3 -m oneflow.distributed.launch -h for more usage
FILE=projects/Llama/train_net.py
CONFIG=projects/Llama/configs/llama_sft.py
GPUS=1
NODE=1
NODE_RANK=0
ADDR=127.0.0.1
PORT=12345
LOGDIR=/home/yfliu/horizontal/oneflowtest/runs/llama2/oneflow

export ONEFLOW_FUSE_OPTIMIZER_UPDATE_CAST=true

python3 -m oneflow.distributed.launch \
--nproc_per_node $GPUS --nnodes $NODE --node_rank $NODE_RANK --master_addr $ADDR --master_port $PORT --logdir $LOGDIR --redirect_stdout_and_stderr \
$FILE --config-file $CONFIG
;;
esac
  • 执行脚本方式

bash llama_sft.sh 2

在执行SFT训练时报错,似乎无法定位到是哪里出了问题。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions