Recommended Approach for Gradient Accumulation #2030

sanchit-gandhi · 2022-04-06T08:59:00Z

sanchit-gandhi
Apr 6, 2022

I'm working on a training script for a Speech model in Flax, and was wondering if I could get an opinion from the community on what the best way is of implementing gradient accumulation in JAX/Flax. To my understanding, there are two viable options:

Optax MultiSteps: the wrapper can incorporated with another Optax optimizer (e.g. Optax Adamw) to provide gradient updates at a prescribed number of gradient accumulation steps, without any change to the training step itself.
A custom gradient accumulation training step: for each training step, a superbatch is formed comprising of all the batched data for N gradient accumulation steps. Gradients are computed and manually accumulated over the N gradient accumulation steps, and only applied once following this.

The simpler of the two approaches, I initially trialled using Optax MultiSteps. Keeping the per-device batch size fixed, I was not able to increase the number of gradient accumulation steps to be any greater than 1 (equivalent to no gradient accumulation!). Thus, I implemented a version of gradient accumulation by hand (see here). Once again, I was not able to increase the number of gradient accumulation steps to be any greater than 1 keeping the per-device batch size fixed. Are there any caveats with using Optax MultiSteps in regards to memory that people have experienced before? As for the custom approach, is there a 'standard' way of going about doing this? Many thanks for all your help!

marcvanzee · 2022-04-07T10:12:54Z

marcvanzee
Apr 7, 2022
Maintainer

Hi @sanchit-gandhi, right now we don't have any simple examples for doing gradient accumulation yet. @melissatan is planning to look into creating a HOWTO explaining how to do this simple setting.

In the meantime, have you seen Discussion #2008? Someone seems to be running into the same problem as you when using Optax Multistep, and they documented their solution here: google-deepmind/optax#320.

Please let us know if this helps resolving issue!

1 reply

sanchit-gandhi Apr 7, 2022
Author

Hi @marcvanzee! Thanks for the reply! A HOWTO would be great - as mentioned in my first post, I think there are a few possible approaches. It'd be cool to document how each of those would look and the pros and cons of each!

Thanks for linking that discussion over at #2008. I checked it out and I think the difference between my issue and theirs was that I was keeping my per-device batch size fixed and only increasing the number of gradient accumulation steps. However, @borisdayma suggested an elegant solution which is working with my custom gradient accumulation implementation: reducing the per-device batch size and increasing the number of gradient accumulation steps may in fact facilitate for a larger effective batch size. For instance:

Batch size = 4, gradient accumulation steps = 1, effective batch size = 4 (baseline)
Batch size = 4, gradient accumulation steps = 2, effective batch size = 8 -> OOM
Batch size = 2, gradient accumulation steps = 4, effective batch size = 8 -> works!

I'll report back with the sorts of batch size increases I can get with MultiSteps and the custom implementation using this trick!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended Approach for Gradient Accumulation #2030

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Recommended Approach for Gradient Accumulation #2030

sanchit-gandhi Apr 6, 2022

Replies: 1 comment · 1 reply

marcvanzee Apr 7, 2022 Maintainer

sanchit-gandhi Apr 7, 2022 Author

sanchit-gandhi
Apr 6, 2022

Replies: 1 comment 1 reply

marcvanzee
Apr 7, 2022
Maintainer

sanchit-gandhi Apr 7, 2022
Author