Skip to content

moss-tts-local-1.5 sft NaN bug #196

Description

@ruby11dog

我的训练脚本:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TRAIN_DATA="xxx"
export OUT_DIR="xxx"
accelerate launch
sft.py
--model-path xx
--train-jsonl "$TRAIN_DATA"
--output-dir "$OUT_DIR"
--per-device-batch-size 8
--gradient-accumulation-steps 4
--learning-rate 2.0e-5
--warmup-ratio 0.0
--lr-scheduler-type constant
--mixed-precision bf16
--channelwise-loss-weight 1,32
--gradient-checkpointing
--skip-nonfinite-batches
会有大量的NaN,无法正常训练:
warning: Non-finite gradient norm at epoch=0 global_step=1: nan. first_nonfinite_grad=module.transformer.embed_tokens.weight bad=388956160/388956160 dtype=torch.bfloat16 shape=(151936, 2560); record_ids=['205785', '55181', '191931', '193623', '160327', '31329', '34026', '194913', '219003', '98766', '48350', '36495', '149423', '58690', '242286', '112645', '82931', '217513', '71839', '42900', '128648', '18118', '119128', '76251', '135686', '177750', '26527', '157316', '162412', '41533', '154624', '179435', '234848', '53256', '243554', '201511', '119731', '131094', '101268', '70420', '127925', '106988', '34870', '211582', '101706', '119840', '119165', '225646']; skipped
都是module.transformer.embed_tokens.weight这里出的NaN,
我的训练环境在moss-delay下训练都是没问题的,麻烦帮忙看看问题,感谢!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions