Per example gradient without `vmap` #3093

long21wt · 2023-05-10T09:23:04Z

long21wt
May 10, 2023

Hi

I have an OOM problem when I try to get per example gradient with vmap, I wonder is there another way to do it without using vmap? A bit of context over it, I'm using vmap + my own gradient accumulation by replacing state.

grad_fn = jax.vmap(grad_fn, in_axes=(None, 0))
loss, grad = grad_fn(state.params, batch)
state = state.replace(accum_grads=jax.tree_util.tree_map(lambda x, y: x + y, state.accum_grads, grad))

Thanks

cgarciae · 2023-05-10T18:06:50Z

cgarciae
May 10, 2023
Maintainer

Hey @long21wt, storing per-example gradients sounds costly, try to reduce the batch size until it fits in memory.

I wonder is there another way to do it without using vmap?

The naive way would be to run grand_fn for each sample independently in a loop, store the outputs in a list, and use tree_map + jnp.stack to get the pytrees into the same shapes as the vmap version. I doubt this will help though.

1 reply

long21wt May 10, 2023
Author

Thanks, I'll have a try. This problem happens particularly for LLMs with more than 300M parameters, e.g., mt5-small I can't go over 45 examples or 18 in the case of mt5-base, both with a single NVIDIA A100 80 GB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per example gradient without `vmap` #3093

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Per example gradient without vmap #3093

long21wt May 10, 2023

Replies: 1 comment · 1 reply

cgarciae May 10, 2023 Maintainer

long21wt May 10, 2023 Author

Per example gradient without `vmap` #3093

long21wt
May 10, 2023

Replies: 1 comment 1 reply

cgarciae
May 10, 2023
Maintainer

long21wt May 10, 2023
Author