You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, mean is used to simulate multi-GPU training, and sum is used to simulate a large batch size. I used 8 GPUs with a max duration of 600. It seems that maintaining the same accum_grad * max_duration * world_size doesn't match the original setup.
I tried running your zipformer/ codes, but my model diverged at epoch 33 and pretraining ended with a
Grad scale is small
error.Throughout pretraining before the divergence, I noticed my
grad scale
tended to fluctuate between 0.125 and 2.Did you face the same issues?
EDIT: I was also wondering if you tried toggling the loss reduction to
mean
instead ofsum
. Maybe that will stabilise training?My commands. I adapted the batch size to my setup, maintaining the same
accum_grad * max_duration * world_size
.# pretraining python zipformer/pretrain.py \ --world-size 4 \ --use-fp16 1 \ --num-epochs 50 \ --manifest-dir data/raw \ --max-duration 350 \ --accum-grad 2 \ --exp-dir zipformer/exp2/pretrain
As per your explanation, I used the same 500 k-means labels from
simple_kmeans
.Originally posted by @teowenshen in #1500 (comment)
The text was updated successfully, but these errors were encountered: