Longer timesteps of RNN causes NaN #2304

alekhka · 2022-07-18T22:17:29Z

alekhka
Jul 18, 2022

Hi!

I adapted the imagenet example code and I am trying to train and Conv-RNN with it. ResNet training works fine. My model looks something like this -

 def __call__(self, x, train: bool = True):
   
    is_initializing = self.scope.is_collection_empty("batch_stats")
    x = self.layer1(x) #input layers 
    
    #RNN part. During compilation, applies the rnn cell once. During training, applies 
    #the rnn cell self.timesteps times.
    state = jnp.zeros_like(x)
    if is_initializing:
        state = self.convrnncell(x, state, train)
    else:
        for i in range(self.timesteps):
            state = self.convrnncell(x, state, train)

    x = self.layer2(state) #readout layers
    return x

I am applying the RNN cell only once during compilation to avoid trigger compilations for every application of the cell. When I train this with say 6 timesteps, the loss goes to NaN quickly. But when I train with fewer timesteps, it seems to do fine. Peculiarly, if I train with JAX_DEBUG_NANS=True it trains fine (but lot slower) with even large timesteps with no NaNs. How can I go about debuggin what is causing these NaNs?

cgarciae · 2022-07-19T17:11:34Z

cgarciae
Jul 19, 2022
Maintainer

Hey @alekhka, I don't why this would happen in principle from Flax's perspective.
I'd look into the implementation of convrnncell. Can you also share the code for this?

2 replies

alekhka Jul 19, 2022
Author

Thank you for the reply @cgarciae !

The cell looks somewhat like this -

  def __call__(self, X, state_prev, train: bool = True):

    op_1x1_first = nn.sigmoid((self.conv_1x1_first(state_prev)))
    op_3x3_first = self.conv_3x3_first(op_1x1_first * state_prev)
    op_bn_first = self.batch_norm_first(op_3x3_first, use_running_average= not train)
    
    op_gated_first = nn.relu(X - nn.relu(op_bn_first * (self.alpha * state_prev + self.mu)))
    
    op_1x1_second = nn.sigmoid((self.conv_1x1_second(op_gated_first)))
    op_3x3_second = self.conv_3x3_second(op_gated_first)
    op_bn_second = self.batch_norm_second(op_3x3_second, use_running_average= not train)

    op_newcalc_state = nn.relu(self.kappa * op_gated_first + self.beta * op_bn_second + self.omega * op_gated_first * op_bn_second)
    new_state = (1 - op_1x1_second) * state_prev + op_1x1_second * op_newcalc_state

    return new_state

I've tried to name the layers to what they are. alpha to mu are all scalar params.

Is my method of doing just one RNN cell pass with if is_initializing: and then during actual train, running the cell many times the right thing to do? I can use Python for loops this way, right?
What could cause this to work for fewer timesteps but not longer? Which layers could cause this behaviour? since all layers in my RNN cell are flax layers.
How can one debug NaNs when setting JAX_DEBUG_NANS=True fixes the NaNs? I believe this means that the NaNs occur only under @jit?

cgarciae Aug 30, 2022
Maintainer

Hey @alekhka, sorry for the late replay.

Is my method of doing just one RNN cell pass with if is_initializing: and then during actual train, running the cell many times the right thing to do? I can use Python for loops this way, right?

I think this is alright in the context. BTW: you can now use the new is_initializing() method to detect when you are inside init.

How can one debug NaNs when setting JAX_DEBUG_NANS=True fixes the NaNs? I believe this means that the NaNs occur only under @jit?

Under JAX_DEBUG_NANS=True, for jitted code only the outputs are checked for NaNs, I'd recommend running in eager mode. If its not possible try adding jax.debug.print statements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Longer timesteps of RNN causes NaN #2304

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Longer timesteps of RNN causes NaN #2304

alekhka Jul 18, 2022

Replies: 1 comment · 2 replies

cgarciae Jul 19, 2022 Maintainer

alekhka Jul 19, 2022 Author

cgarciae Aug 30, 2022 Maintainer

alekhka
Jul 18, 2022

Replies: 1 comment 2 replies

cgarciae
Jul 19, 2022
Maintainer

alekhka Jul 19, 2022
Author

cgarciae Aug 30, 2022
Maintainer