How to use `nn.remat` for LSTM to save memory? #2408

twni2016 · 2022-08-20T00:13:01Z

twni2016
Aug 20, 2022

Hi Flax community,

I'd like to use nn.remat to save memory when using LSTM to process very long sequences. However, I found existing tutorial for nn.remat is on deep feedforward neural networks, not recurrent ones. I tried nn.remat in LSTM, but I did not find any change in memory, but only increase in time. Please see the demo code below. I tried seq2seq = True and seq2seq = False, but neither of them works. Could you help me with that? Or should I use nn.remat_scan?

import os

os.environ["XLA_PYTHON_CLIENT_PREALLOCATE"] = "false"

import time
import functools
from flax import linen as nn
import jax
from jax import numpy as jnp
import optax
from flax.training.train_state import TrainState

class SimpleLSTM(nn.Module):
    hidden_size: int
    output_dim: int
    seq2seq: bool

    def setup(self):
        self.cell = nn.OptimizedLSTMCell()
        self.linear = nn.Dense(self.output_dim)

    @functools.partial(
        nn.scan,
        variable_broadcast="params",
        in_axes=1,
        out_axes=1,
        split_rngs={"params": False},
    )
    def forward(self, carry, x):
        return self.cell(carry, x)
    
    @nn.remat # did not help!
    def __call__(self, embedded_inputs):
        batch_size, seq_length, _ = embedded_inputs.shape  # (B, T, X)

        initial_state = self.initialize_carry((batch_size,), self.hidden_size)
        _, outputs = self.forward(initial_state, embedded_inputs) # (B, T, D)

        if self.seq2seq:
            preds = self.linear(outputs) # (B, T, Y)
        else:
            preds = self.linear(outputs[:, -1]) # (B, Y)
        return preds

    def initialize_carry(self, batch_dims, hidden_size):
        return self.cell.initialize_carry(
            jax.random.PRNGKey(0), batch_dims, hidden_size
        )

@jax.jit
def generate_data(key):
    x = jax.random.normal(key, shape=(batch_size, seq_length, input_dim))
    key, sub_key = jax.random.split(key)
    if seq2seq:
        y_shape = (batch_size, seq_length, output_dim)
    else:
        y_shape = (batch_size, output_dim)
    y = jax.random.normal(sub_key, shape=y_shape)
    return x, y

def create_train_state(key, model, x):
    params = jax.jit(model.init)(key, x)["params"]
    tx = optax.adam(learning_rate=1e-3)

    state = TrainState.create(apply_fn=model.apply, params=params, tx=tx)
    return state

@jax.jit
def train_step(
    state: TrainState, x: jnp.array, y: jnp.array
):
    def loss_fn(params):
        preds = state.apply_fn(
            {"params": params},
            x,
        )
        losses = optax.l2_loss(
            preds, y
        )  # (B, T, Y) or (B, Y)
        loss = losses.mean()

        return loss

    grad_fn = jax.value_and_grad(loss_fn)
    loss, grads = grad_fn(state.params)

    new_state = state.apply_gradients(grads=grads)

    metrics = {"loss": loss}

    return new_state, metrics


hidden_size = 128
batch_size = 10
seq_length = 10000
input_dim = 32
output_dim = 4
seq2seq = True

model = SimpleLSTM(hidden_size=hidden_size, output_dim=output_dim, seq2seq=seq2seq)

key = jax.random.PRNGKey(1)
key, sub_key = jax.random.split(key)
x, y = generate_data(sub_key)

key, sub_key = jax.random.split(key)
state = create_train_state(sub_key, model, x)

num_steps = 100000
for i in range(num_steps):
    t0 = time.time()
    key, sub_key = jax.random.split(key)
    x, y = generate_data(sub_key)
    state, metrics = train_step(state, x, y)
    print(i, jax.device_get(metrics), f"{time.time() - t0:.4f}")

jheek · 2022-08-25T12:53:29Z

jheek
Aug 25, 2022
Maintainer

You are now rematerializing the entire LSTM loop. For remat to reduce memory you should rematerialize smaller chunks like for example each cell execution, e.g.:

@functools.partial(
        nn.scan,
        variable_broadcast="params",
        in_axes=1,
        out_axes=1,
        split_rngs={"params": False},
    )
@remat
    def forward(self, carry, x):
        return self.cell(carry, x)
``

1 reply

twni2016 Aug 25, 2022
Author

Hi @jheek, thank you very much for helping me with this! It works :) The memory cost has been reduced by 30%.

But I found the constant rate of reduction in memory does not comply with the original paper (Chen et al., 2016) which proposed a sublinear memory costs (O(sqrt(n)).

Do you think it is because I call remat in each cell execution? Can I change to remat every sqrt(n) cells to have sublinear rate?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use `nn.remat` for LSTM to save memory? #2408

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to use nn.remat for LSTM to save memory? #2408

twni2016 Aug 20, 2022

Replies: 1 comment · 1 reply

jheek Aug 25, 2022 Maintainer

twni2016 Aug 25, 2022 Author

How to use `nn.remat` for LSTM to save memory? #2408

twni2016
Aug 20, 2022

Replies: 1 comment 1 reply

jheek
Aug 25, 2022
Maintainer

twni2016 Aug 25, 2022
Author