-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any changes possible to reduce GPU memory usage? #10
Comments
Hi, For a "large: model with around 3 billion parameters, I guess the optimizer is probably not the bottleneck of memory comparing with gradient calculation in back-propagation. Can I ask how large is your batch size and have you tried to use gradient accumulation? |
Hi! |
Hello again
I suspect that the last line ( |
Thanks for the updates! If I understand correctly, storing the parameters together with the optimizer states is indeed the bottleneck of memory. Since apollo has one more state (3 vs. 2) than adam, you cannot train the large model with apollo. What you did is to transfer some optimizer states to cpu to save memory, and found it works even faster! I guess one possible reason is that the GPU may get slow when the memory is close to run out. |
Please let me know if you find apollo obtains better results on the large model. Thanks! |
Hi
I think that apollo is a great contribution and have used it with great success for "small" ( about a half billion) parameter models. However, in trying it with a 3 billion parameter model, I have hit GPU memory limits that crashes my training run. I've been looking at the code to see if there was something I could do - like deleting variables after they were used in case python is not doing it efficiently, However, I don't see a path to make major reductions to the memory usage. Do you have any suggestion on modification to be used? Maybe a version that stores optimizer information on CPU rather than GPU , brings it into GPU only when needed for calculations, and then releases the GPU memory when done with it?
Thanks in advance!
The text was updated successfully, but these errors were encountered: