Best practices for using pjit with stochastic modules #3090

hr0nix · 2023-05-08T20:41:33Z

hr0nix
May 8, 2023

Hello!

I'm trying to figure out what are the best practices to train a model with stochastic forward pass (say, transformer decoder with dropout) using pjit.

My train step that I compile using pjit looks something like this:

class TrainState(flax.training.train_state.TrainState):
    rng: PRNGKey  # A non-batched rng

def training_step(
        train_state: TrainState,
        batch: PyTree,
) -> tuple[TrainState, jax.Array]:
    rng, new_rng = jax.random.split(train_state.rng)

    loss_value_and_grad_fn = jax.value_and_grad(compute_loss, argnums=0)
    loss, grad = loss_value_and_grad_fn(train_state.params, train_state, batch)

    updates, new_optimizer_state = train_state.tx.update(grad, train_state.opt_state, train_state.params)
    new_params = optax.apply_updates(train_state.params, updates)

    train_state = train_state.replace(
        params=new_params,
        opt_state=new_optimizer_state,
        rng=new_rng,
    )

    return train_state, loss

For now I just shard everything along the batch dimension only to implement data parallelism.

If I just compile the code as is, everything works, but is very slow, because XLA compiler inserts collective primitives for every dropout, as mentioned at https://jax.readthedocs.io/en/latest/notebooks/Distributed_arrays_and_automatic_parallelization.html#generating-random-numbers

If I set

jax.config.update('jax_threefry_partitionable', True)

before compiling the code, collective primitives are no longer present. However for some reason XLA decides to allocate large i32 arrays for (my guess) storing RNG keys for every dropped out activation. It increases memory consumption significantly, and I can no longer fit the batch of the same size in the memory of a single GPU. This is especially surprising because transformer blocks are rematerialized, however as far as I understand XLA intends to preserve these buffers anyway.

OOM message that led me to this conclusion (128 x 1024 x 768 is batch x seq_len x embedding_dim):

...
        Buffer 10:
                Size: 384.00MiB
                Operator: op_name="pjit(training_step)/jit(main)/jvp(TransformerModel)/_transformer/block_0/mlp_dropout/threefry2x32" source_file="/home/user/papyrax/venv/lib/python3.10/site-packages/flax/linen/stochastic.py" source_line=76
                XLA Label: custom-call
                Shape: u32[128,1024,768]
                ==========================

        Buffer 11:
                Size: 384.00MiB
                Operator: op_name="pjit(training_step)/jit(main)/jvp(TransformerModel)/_transformer/block_0/mlp_dropout/threefry2x32" source_file="/home/user/papyrax/venv/lib/python3.10/site-packages/flax/linen/stochastic.py" source_line=76
                XLA Label: custom-call
                Shape: u32[128,1024,768]
                ==========================

        Buffer 12:
                Size: 384.00MiB
                Operator: op_name="pjit(training_step)/jit(main)/jvp(TransformerModel)/_transformer/block_11/mlp_dropout/iota_2x32_shape[shape=(128, 1024, 768)]" source_file="/home/user/papyrax/venv/lib/python3.10/site-packages/flax/linen/stochastic.py" source_line=76
                XLA Label: fusion
                Shape: u32[128,1024,768]
                ==========================

        Buffer 13:
                Size: 384.00MiB
                Operator: op_name="pjit(training_step)/jit(main)/jvp(TransformerModel)/_transformer/block_11/mlp_dropout/iota_2x32_shape[shape=(128, 1024, 768)]" source_file="/home/user/papyrax/venv/lib/python3.10/site-packages/flax/linen/stochastic.py" source_line=76
                XLA Label: fusion
                Shape: u32[128,1024,768]
                ==========================

        Buffer 14:
                Size: 384.00MiB
                Operator: op_name="pjit(training_step)/jit(main)/jvp(TransformerModel)/_transformer/block_0/att_dropout/threefry2x32" source_file="/home/user/papyrax/venv/lib/python3.10/site-packages/flax/linen/stochastic.py" source_line=76
                XLA Label: custom-call
                Shape: u32[128,1024,768]
                ==========================

I have several questions:

What's the purpose of these int32 arrays and why are they necessary?
What's the easiest way to implement stochasticity with no communication or memory overhead? Perhaps I'm doing something wrong?

cgarciae · 2023-05-09T19:08:39Z

cgarciae
May 9, 2023
Maintainer

Hey @hr0nix, keep an eye on jax-ml/jax#15783.

One idea would be to split + partition the rng and then use shard_map inside jit to efficiently perform the per-device dropout operation by slicing the rng. However, it could get little tricky as Flax currently doesn't provide lifted version of shard_map.

1 reply

hr0nix Nov 5, 2023
Author

Turns out it runs deeper: somehow jax rematerialization is broken for random keys.

jax-ml/jax#17982 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for using pjit with stochastic modules #3090

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Best practices for using pjit with stochastic modules #3090

hr0nix May 8, 2023

Replies: 1 comment · 1 reply

cgarciae May 9, 2023 Maintainer

hr0nix Nov 5, 2023 Author

hr0nix
May 8, 2023

Replies: 1 comment 1 reply

cgarciae
May 9, 2023
Maintainer

hr0nix Nov 5, 2023
Author