Google colab speedruns #82
MichelNivard
started this conversation in
General
Replies: 1 comment 1 reply
-
You need to implement gradient accumulation. IIRC setting the device amount changes the effective batch size. This should work to make the loss reproducible: #29 (comment) |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
I have just started modding the repo (specific commit: d6a7f06) for speedruns on google colab.
see: https://github.com/MichelNivard/modded-nanogpt
So far to make things run at all I have has to set
fp8=False
on line 330 (A100 doesnt do fp8). More obviously I have had to set world_size = 1 and in the run sript I have set nporc_per_node = 1.This lead to a pretty serious setback in validatino los after a set number of steps compaed with recent results.
for ex at step 1375:
Colab w A100:
step:1375/1770 val_loss:3.9751 train_time:1117031ms step_avg:812.39ms
A recent record (Sub 3 Min):
step:1375/1393 val_loss:3.2820 train_time:177070ms step_avg:129.72ms
Any ideas on how to claw some of that back? Clearly setting FP8 to False interactos with other mods and effects the model performance not just speed.
I'll add that the duing training the GPU is under strong memory preassure right now, so hints on how to tweak the batch size would be appreciated as well!
Beta Was this translation helpful? Give feedback.
All reactions