Skip to content

aishell复现结果和readme的结果不符 #4

@brightLLer

Description

@brightLLer

各位大佬们好,我们在aishell1上复现了whisper large-v3 + qwen2 7B的实验,但发现模型的输出存在明显的"复读"(尾部若干字重复了许多遍)以及输出标点符号,特殊符号等情况,我们在推理的时候将大模型的repetition_penalty提高了,复读现象有所好转,但删除所有标点符号后字错率仍高达11%+,与README.md中的5.55%差距较大,以下是我们的训练命令(代码中whisper中提特征是80维的,我们添加了一个n_mel=128参数以支持large-v3):

torchrun --standalone --nnodes=1 --nproc_per_node=8 train.py \
        --llm_model_name_or_path Qwen2-7B-Instruct \
        --whisper_model_name_or_path whisper/large-v3.pt \
        --data_path aishell/train/train.jsonl \
        --eval_data_path aishell/dev/eval.jsonl \
        --bf16 True \
        --output_dir Qwen-7B-Instruct-whisper-large-v3-aishell \
        --num_train_epochs 10 \
        --per_device_train_batch_size 16 \
        --per_device_eval_batch_size 8 \
        --gradient_accumulation_steps 8 \
        --evaluation_strategy "no" \
        --save_strategy "steps" \
        --save_steps 100 \
        --save_total_limit 10 \
        --learning_rate 3e-4 \
        --weight_decay 0.01 \
        --adam_beta2 0.95 \
        --warmup_ratio 0.01 \
        --lr_scheduler_type "cosine" \
        --logging_steps 1 \
        --report_to "none" \
        --model_max_length 512 \
        --n_mels 128 \
        --gradient_checkpointing \
        --dataloader_num_workers 4 \
        --dataloader_prefetch_factor 10 \
        --deepspeed ds_config_zero3.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions