Accumulation of Monte-Carlo gradients within a flax module to avoid OOM error #2301

etienne-thuillier · 2022-07-18T18:35:44Z

etienne-thuillier
Jul 18, 2022

I am currently training a model with an encoder-decoder architecture, and which task is to reconstruct a corrupted version (e.g. cropped, masked) version of the input. The model is stochastic, and the maximum likelihood is estimated using Monte-Carlo sampling of the latent representation distribution.

A much-simplified and lightweight version of the model and train script is provided below to make things concrete.

The problem I face is that, in reality and unlike the given example, the input and activations of my model are of very high dimensionality. Running a training step while drawing a single Monte-Carlo sample (n_samples=1) barely fits into a a100 card with 80 GB of memory.

I have investigated how to bring this memory consumption down, and will be wrapping the encoder with jax's gradient checkpointing (rematerialization) decorator to somewhat mitigate memory consumption under reverse mode automatic differentiation.

More critical however, is that I need to draw a significant quantity of samples from the latent representation, say 16, to accurately estimate the likelihood objective. This implies ~16X the memory increase compared to the simple sample case, and will inevitably to an OOM error.

I would like to circumvent this issue by carrying-out the (16) forward calls to the decoder in sequence (as opposed to parallel) and similarly accumulate the associated gradients during the backward pass. In practice, how can I replace the vmap-based line below from the class Model to carry this out?
mu_x_hat, sigma_x_hat = jax.vmap(decoder, in_axes=0)(z)

It seems that this could be carried out using jax's custom_vjp / custom_jvp functionality, e.g. jax-ml/jax#10131.

However, I am not sure how to carry this out under the flax framework, in particular within a flax module, and how this ties to model parameter management. In particular, I would need to both accumulate the gradients of the decoder's parameter and that of the decoder's input (i.e. samples z), and complete the back-propagation towards the encoder...

from flax.training import train_state
import optax
import jax
import jax.numpy as jnp
from functools import partial

n_in_channels = 1


# much simplified and lightweight version of encoder


class Encoder(nn.Module):
	sigma_floor: float

	@nn.compact
	def __call__(self, x):

		x = nn.Conv(features=2 * n_in_channels, kernel_size=(2, 2))(x)
		x = nn.relu(x)
		x = nn.Conv(features=2 * n_in_channels, kernel_size=(2, 2))(x)

		def to_sigma(z):
			return self.sigma_floor + (1.0 - self.sigma_floor) * nn.softplus(z)

		mu = x[..., :n_in_channels]
		sigma = to_sigma(x[..., n_in_channels:])

		return mu, sigma


# in reality the decoder is different than the encoder, but this isn't relevant to the pending question, hence for
# simplicity:

Decoder = Encoder


class Model(nn.Module):

	@nn.compact
	def __call__(self, masked_x, rng, n_samples):
		"""
		:param masked_x:        array having shape (batch, spatial dim. 1, spatial dim. 2, channels)
		:param rng:             random number generator
		:return:                array having shape (MC sample, batch, spatial dim. 1, spatial dim. 2, channels)
		"""

		mu, sigma = Encoder(sigma_floor=1.0e-3)(masked_x)

		# monte-carlo sampling of latent representation
		z = jax.random.normal(key=rng, shape=(n_samples, *mu.shape))

		# re-parametrisation trick
		z *= jnp.expand_dims(sigma, 0)      # stack monte-carlo (MC) samples along new leading sample dimension
		z += jnp.expand_dims(mu, 0)

		decoder = Decoder(sigma_floor=1.0e-3)

                # how can we edit the following line to save memory?
		mu_x_hat, sigma_x_hat = jax.vmap(decoder, in_axes=0)(z)

		return mu_x_hat, sigma_x_hat


@jax.jit
def log_likelihood(x, mu, sigma):

	return - 0.5 * ((x - mu) ** 2) / (sigma ** 2) - jnp.log(sigma)


@partial(jax.jit, static_argnums=(3,))
def train_step(state, x, rng, n_samples):

	rng, key = jax.random.split(rng, 2)
	mask = jax.random.bernoulli(key, p=0.5, shape=(*x.shape[:-1], 1))

	def loss(params):

		masked_x = mask * x
		mu_x_hat, sigma_x_hat = state.apply_fn({'params': params}, masked_x, rng, n_samples)


		# NB:
		# 	+ mu_x_hat and sigma_x_hat have shape (sample, batch, spatial dim. 1, spatial dim. 2, channel)
		# 	+ x has shape (batch, spatial dim. 1, spatial dim. 2, channel)

		ll = jax.vmap(log_likelihood, in_axes=(0, 0, 0))                    # vmap across batch elements
		ll = jax.vmap(ll, in_axes=(None, 0, 0))(x, mu_x_hat, sigma_x_hat)   # vmap across sampled realisations

		ll = ll.sum((2, 3, 4))                                              # sum over spatial dims and channel
		ll = ll.mean()                                                      # mean over samples and batch

		return -ll

	nll, grad = jax.value_and_grad(loss, argnums=0)(state.params)

	return nll, state.apply_gradients(grads=grad)


def train(seed, n_batches, batch_size, n_samples):

	rng = jax.random.PRNGKey(seed)
	rng, data_key, init_key = jax.random.split(rng, 3)

	# data

	x = jax.random.normal(key=data_key, shape=(n_batches*batch_size, 16, 16, n_in_channels))   # shape is (batch, spatial dim. 1, spatial dim. 2, channel)

	# model

	state = train_state.TrainState.create(
		apply_fn=Model().apply,
		params=Model().init(init_key, jnp.zeros_like(x), rng, n_samples)['params'],
		tx=optax.adam(1.0e-3)
	)

	# train loop

	for step in range(x.shape[0] // batch_size):

		batch = x[step * batch_size:(step + 1) * batch_size, ...]

		rng, key = jax.random.split(rng, 2)

		nll, state = train_step(state, batch, key, n_samples)

		print('step %i: %f' % (step, nll))


if __name__ == '__main__':

	train(seed=42, n_batches=4, batch_size=4, n_samples=16)

Answered by etienne-thuillier

Jul 18, 2022

I had mistakingly posted this question on google/jax's Q&A originally. I wasn't aware of the "transfer discussion" functionality. Hence I copied the post here and deleted the content of the initial post.

In the meantime, @YouJiacheng had posted and answer in the original post, see jax-ml/jax#11528

View full answer

etienne-thuillier · 2022-07-18T19:02:58Z

etienne-thuillier
Jul 18, 2022
Author

I had mistakingly posted this question on google/jax's Q&A originally. I wasn't aware of the "transfer discussion" functionality. Hence I copied the post here and deleted the content of the initial post.

In the meantime, @YouJiacheng had posted and answer in the original post, see jax-ml/jax#11528

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accumulation of Monte-Carlo gradients within a flax module to avoid OOM error #2301

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Accumulation of Monte-Carlo gradients within a flax module to avoid OOM error #2301

etienne-thuillier Jul 18, 2022

Replies: 1 comment

etienne-thuillier Jul 18, 2022 Author

etienne-thuillier
Jul 18, 2022

etienne-thuillier
Jul 18, 2022
Author