How did you prepare the manifest dir for pretrain and in which format? #1705

sanjuktasr · 2024-07-26T08:12:12Z

          Hi there @yfyeung , first of all thank you for creating this SSL recipe!

I tried running your zipformer/ codes, but my model diverged at epoch 33 and pretraining ended with a Grad scale is small error.

Throughout pretraining before the divergence, I noticed my grad scale tended to fluctuate between 0.125 and 2.

Did you face the same issues?

EDIT: I was also wondering if you tried toggling the loss reduction to mean instead of sum. Maybe that will stabilise training?

My commands. I adapted the batch size to my setup, maintaining the same accum_grad * max_duration * world_size.

# pretraining
python zipformer/pretrain.py \
    --world-size 4 \
    --use-fp16 1 \
    --num-epochs 50 \
    --manifest-dir data/raw \
    --max-duration 350 \
    --accum-grad 2 \
    --exp-dir zipformer/exp2/pretrain

As per your explanation, I used the same 500 k-means labels from simple_kmeans.

Originally posted by @teowenshen in #1500 (comment)

The text was updated successfully, but these errors were encountered:

yfyeung · 2024-07-26T10:08:43Z

          Hi there @yfyeung , first of all thank you for creating this SSL recipe!
I tried running your zipformer/ codes, but my model diverged at epoch 33 and pretraining ended with a Grad scale is small error.

Throughout pretraining before the divergence, I noticed my grad scale tended to fluctuate between 0.125 and 2.

Did you face the same issues?

EDIT: I was also wondering if you tried toggling the loss reduction to mean instead of sum. Maybe that will stabilise training?

My commands. I adapted the batch size to my setup, maintaining the same accum_grad * max_duration * world_size.
# pretraining
python zipformer/pretrain.py \
    --world-size 4 \
    --use-fp16 1 \
    --num-epochs 50 \
    --manifest-dir data/raw \
    --max-duration 350 \
    --accum-grad 2 \
    --exp-dir zipformer/exp2/pretrain
As per your explanation, I used the same 500 k-means labels from simple_kmeans.

Originally posted by @teowenshen in #1500 (comment)

Hi, mean is used to simulate multi-GPU training, and sum is used to simulate a large batch size. I used 8 GPUs with a max duration of 600. It seems that maintaining the same accum_grad * max_duration * world_size doesn't match the original setup.

sanjuktasr mentioned this issue Jul 26, 2024

k2SSL: a Faster and Better Framework for Self-Supervised Speech Representation Learning #1500

Merged

yfyeung closed this as completed Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How did you prepare the manifest dir for pretrain and in which format? #1705

How did you prepare the manifest dir for pretrain and in which format? #1705

sanjuktasr commented Jul 26, 2024

yfyeung commented Jul 26, 2024

How did you prepare the manifest dir for pretrain and in which format? #1705

How did you prepare the manifest dir for pretrain and in which format? #1705

Comments

sanjuktasr commented Jul 26, 2024

yfyeung commented Jul 26, 2024