moss-tts-local-1.5 sft NaN bug

我的训练脚本：
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TRAIN_DATA="xxx"
export OUT_DIR="xxx"
accelerate launch \
    sft.py \
    --model-path xx \
    --train-jsonl  "$TRAIN_DATA"\
    --output-dir  "$OUT_DIR" \
    --per-device-batch-size 8 \
    --gradient-accumulation-steps 4 \
    --learning-rate 2.0e-5 \
    --warmup-ratio 0.0 \
    --lr-scheduler-type constant \
    --mixed-precision bf16 \
    --channelwise-loss-weight 1,32 \
    --gradient-checkpointing \
    --skip-nonfinite-batches
会有大量的NaN，无法正常训练：
warning: Non-finite gradient norm at epoch=0 global_step=1: nan. first_nonfinite_grad=module.transformer.embed_tokens.weight bad=388956160/388956160 dtype=torch.bfloat16 shape=(151936, 2560); record_ids=['205785', '55181', '191931', '193623', '160327', '31329', '34026', '194913', '219003', '98766', '48350', '36495', '149423', '58690', '242286', '112645', '82931', '217513', '71839', '42900', '128648', '18118', '119128', '76251', '135686', '177750', '26527', '157316', '162412', '41533', '154624', '179435', '234848', '53256', '243554', '201511', '119731', '131094', '101268', '70420', '127925', '106988', '34870', '211582', '101706', '119840', '119165', '225646']; skipped
都是module.transformer.embed_tokens.weight这里出的NaN，
我的训练环境在moss-delay下训练都是没问题的，麻烦帮忙看看问题，感谢！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

moss-tts-local-1.5 sft NaN bug #196

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

moss-tts-local-1.5 sft NaN bug #196

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions