Sharding PRNG Keys Across Devices #2021

sanchit-gandhi · 2022-03-30T14:59:31Z

sanchit-gandhi
Mar 30, 2022

The shard_prng_key function from the flax.training.common_utils module is described as generating different PRNGs across local devices at train time:

flax/flax/training/common_utils.py

Lines 29 to 34 in cd5c4d7

    
           def shard_prng_key(prng_key): 
        
             # PRNG keys can used at train time to drive stochastic modules 
        
             # e.g. DropOut. We would like a different PRNG key for each local 
        
             # device so that we end up with different random numbers on each one, 
        
             # hence we split our PRNG key and put the resulting keys into the batch 
        
             return jax.random.split(prng_key, num=jax.local_device_count())

I have a somewhat conceptual question concerning this use of different PRNG keys across local devices - I'll rely on several simple examples throughout this discussion to help convey my points.

Single Device Training

First, let's assume that we are training a stochastic model on a single device. Given a batch of inputs X, we train the stochastic model f to generate predictions Y: Y = f(X). For a single forward pass, we provide the model with a single PRNG key. This PRNG key determines the stochastic behaviour of the model (for instance, the set of nodes that are dropped out in this forward pass) to generate the predictions Y:
(Inputs X, PNRG key) -> Model f -> Outputs Y
Because we only pass one key for one forward pass, we obtain the same random behaviour for the model over all training examples in the batch X. For instance, the same set of nodes are dropped out for all training examples x in the batch X.

Multiple Device Training

Single PRNG Key
Now, let's assume that we are training the same stochastic model over N multiple devices 1,...,N. In this case, we shard our batch of inputs X over our N devices, giving the sharded inputs x1,..., xN, and map our single device training step over our our N multiple devices (through pmap or pjit for example). Assuming we pass the same PRNG key to the model on each of the devices, a single forward pass looks as follows:
(x1, PNRG key) -> f -> y1
(x2, PNRG key) -> f -> y2
...
(xN, PNRG key) -> f -> yN
where yi denotes the sharded output corresponding to the sharded input xi. Since we pass the same PRNG key over all devices, the stochastic behaviour of the model is the same for each device as it was in the single device case (i.e. the same nodes are dropped out in each device as in the single device). Consequently, if we perform the inverse sharding operation on our outputs y1, y2,...yN, we obtain an output Y that is identical to that which we obtained for the single device.

Sharded PRNG Keys
The description of the shard_prng_key function alludes to the fact that the PRNG key should be sharded across devices, in addition to the batched inputs. Supposing we utilise this function to shard our single PRNG key to N PNRG keys 1,...,N, one for each device, a single forward pass is then:
(x1, PNRG key 1) -> f -> y1
(x2, PNRG key 2) -> f -> y2
...
(xN, PNRG key N) -> f -> yN
Since we have supplied the model with a different PRNG key over different devices, the stochastic nature of the model will be different over different devices. For example, different nodes will be dropped out over different devices, thus giving different outputs. As a result, performing the inverse sharding operation on the outputs y1, y2,...yN will not yield an output that is identical to the single model case. Since the model behaves differently for each of our sharded inputs xi, we have effectively split our batch of training data X into N smaller mini batches x1,..., xN.

This leads me on to my overall question: should we not set the same PRNG key for the model over all devices to maintain our overall batch size and equivalence to single device training?

For reference, the Hugging Face training script run_summarization_flax uses shard_prng_key to generate device specific PRNG keys when replicating the TrainState over multiple devices:
https://github.com/huggingface/transformers/blob/147c8166852db64de12b851b8307f44c9e8fe0dd/examples/flax/summarization/run_summarization_flax.py#L304-L308

Answered by jheek

Apr 6, 2022

This leads me on to my overall question: should we not set the same PRNG key for the model over all devices to maintain our overall batch size and equivalence to single device training?

The missing detail here is that even with the same PRNGKey the randomness is not the same for different items in the batch so if you do dropout(rng, batch) it will dropout different features for each item in the batch. By extension, if you want pmap(dropout(rng, batch)) the mini-batches on each device should have different dropout masks.

If you want to make sure that the random noise generated in the same way irrespective of global and local batch size you should use vmap. For a single input you can then…

View full answer

jheek · 2022-04-06T13:17:21Z

jheek
Apr 6, 2022
Maintainer

This leads me on to my overall question: should we not set the same PRNG key for the model over all devices to maintain our overall batch size and equivalence to single device training?

The missing detail here is that even with the same PRNGKey the randomness is not the same for different items in the batch so if you do dropout(rng, batch) it will dropout different features for each item in the batch. By extension, if you want pmap(dropout(rng, batch)) the mini-batches on each device should have different dropout masks.

If you want to make sure that the random noise generated in the same way irrespective of global and local batch size you should use vmap. For a single input you can then calculate the dropout_rng as dropout_rng = random.fold_in(dropout_seed, idx_in_batch + batch_id * batch_size)

3 replies

sanchit-gandhi Apr 6, 2022
Author

Thank you Jonathan, that's super helpful. I have a follow-up question to your response in regards to gradient accumulation. For simplicity, let's assume that we are implementing this on a single device. Given an overall batch of inputs X which we want to pass into the model over N gradient accumulation steps, we first sub-divide the inputs up into N mini-batches x1,..., xN.
Assuming we pass the same PRNG key to the model on each of the mini batches, a single forward pass looks as follows:
(x1, PNRG key) -> f -> y1
(x2, PNRG key) -> f -> y2
...
(xN, PNRG key) -> f -> yN

The missing detail here is that even with the same PRNGKey the randomness is not the same for different items in the batch so if you do dropout(rng, batch) it will dropout different features for each item in the batch.

Since different nodes are dropped out for different mini batches xi, we again loose the equivalence with passing a single overall batch of X through the model in one forward pass and obtaining an output Y. For example, an input X with a batch-size of 4 would not be equivalent X sub-divided up into 2 mini batches x1 and x2: the nodes dropped out in x1 and x2 would differ to those dropped out in X, and so the inputs x1, x2 would in effect separate batches, yielding an effective batch size of 2 instead of 4.

To achieve the same effective batch size with/without gradient accumulation, would one not need to fix the random noise to be the same over all mini batches using the formulation you have specified? I don't think this is mentioned in the docs for the Optimizers (https://optax.readthedocs.io/en/latest/api.html#optax.MultiSteps).

Many thanks!

jheek Apr 7, 2022
Maintainer

gradient accumulation and data parallelism are effectively the same computation scheduled in different ways (in series vs in parallel). So the same reasoning should apply here. At the end of the day you should just ask yourself: should the random nose be tied or independent?
I think the confusion here comes from the fact that you are assuming that random functions produce tied random numbers but this is not the case. When you call let's say random.normal(seed, [n]) all n elements of the returned array will have independent noise even though they all use the same seed but if you do a = random.normal(rng, [n//2]) and b = random.normal(rng, [n//2]) now the noise is tied elemntwise (a[i] == b[i] so in order to get fully independent noise you should use different rng seeds for a and b.

sanchit-gandhi Apr 13, 2022
Author

Thank you for clarifying @jheek.

I think part of the confusion stems from reading the Optax optimizer docs. For instance, the docs for apply_every state that:

Using apply_every with a batch size of N/2 and k=2 is not necessarily equivalent to not using apply_every with a batch size of N. If this equivalence is important for you, consider using the optax.MultiSteps.

This statement alludes to the fact that using the optimizer optax.MultiSteps with a batch size of N/2 and 2 mini batches gives one-to-one equivalence to not using optax.MultiSteps with a batch size of N. However, as highlighted by your previous reply, this is not necessarily the case. Even with the same rng, dropout(rng, minibatch) will dropout different features for each item in the mini batch of size N/2, thus giving different outputs to using a single batch of size N with dropout(rng, batch).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharding PRNG Keys Across Devices #2021

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Sharding PRNG Keys Across Devices #2021

sanchit-gandhi Mar 30, 2022

Single Device Training

Multiple Device Training

Replies: 1 comment · 3 replies

jheek Apr 6, 2022 Maintainer

sanchit-gandhi Apr 6, 2022 Author

jheek Apr 7, 2022 Maintainer

sanchit-gandhi Apr 13, 2022 Author

sanchit-gandhi
Mar 30, 2022

Replies: 1 comment 3 replies

jheek
Apr 6, 2022
Maintainer

sanchit-gandhi Apr 6, 2022
Author

jheek Apr 7, 2022
Maintainer

sanchit-gandhi Apr 13, 2022
Author