How to fix some layers for transfer learning? #1706

wztdream · 2021-12-09T16:20:08Z

wztdream
Dec 9, 2021

Hi,

Suppose I need do transfer learning based on pretrained ResNet50, and I want to fix the first layer, and allow other layers to update. How to do this in flax?

I noticed #1176, they suggest to use https://flax.readthedocs.io/en/latest/flax.optim.html#flax.optim.MultiOptimizer, but there are two concerns, one is that it seems flax.optim tends to be replace by optax and optax do not have similar API. Second, In my project I already using optax, so I do not prefer change the code just for fix a layer, it seems cumbersome, I think there should be easy way to do it.

The preferred way I can imagine is something like below:

load the pre trained params
split the params to "fix" collection and "trainable" collection
during training only pass the "trainable" collection to the apply_fn
I have no idea how to deal with the "fix" collection, it seems we have to pass "all" params to the apply_fn in order to "run" the model, so seems this way can not work?

Answered by matthias-wright

Dec 10, 2021

Hi @wztdream,

note that optax is now the recommended optimizer API (see here).

You can freeze a subset of your params by using optax.multi_transform.

One way of achieving this is to create a mask of your params, that assigns one label to trainable parameters and another label to frozen parameters.

Consider this simple parameter tree:

 params
   frozen1
     kernel
     bias
   trainable2
     kernel
     bias
   trainable3
     kernel
     bias
   trainable4
     kernel
     bias

We want to freeze the params of the layers whose names start with frozen, so we assign them the label zero (the name is arbitrary). The other parameters will be assigned the name adam (also arbitrary):

 params
 …

View full answer

matthias-wright · 2021-12-10T00:09:56Z

matthias-wright
Dec 10, 2021

Hi @wztdream,

note that optax is now the recommended optimizer API (see here).

You can freeze a subset of your params by using optax.multi_transform.

One way of achieving this is to create a mask of your params, that assigns one label to trainable parameters and another label to frozen parameters.

Consider this simple parameter tree:

 params
   frozen1
     kernel
     bias
   trainable2
     kernel
     bias
   trainable3
     kernel
     bias
   trainable4
     kernel
     bias

We want to freeze the params of the layers whose names start with frozen, so we assign them the label zero (the name is arbitrary). The other parameters will be assigned the name adam (also arbitrary):

 params
   frozen1 zero
   trainable2
     kernel adam
     bias adam
   trainable3
     kernel adam
     bias adam
   trainable4
     kernel adam
     bias adam

And then:

def zero_grads():
    # from https://github.com/deepmind/optax/issues/159#issuecomment-896459491
    def init_fn(_): 
        return ()
    def update_fn(updates, state, params=None):
        return jax.tree_map(jnp.zeros_like, updates), ()
    return optax.GradientTransformation(init_fn, update_fn)

tx = optax.multi_transform({'adam': optax.adam(0.1), 'zero': zero_grads()},
                           create_mask(params, lambda s: s.startswith('frozen')))

This simply means that all parameters with the label adam will be updated with optax.adam and all the parameters with the label zero will not be updated.

The only thing you have to do for this to work is give the modules you want to freeze a name that starts with frozen (or any other label you might choose).

I recently did something similar, so I created a Colab with a full example: https://colab.research.google.com/drive/1g_pt2Rc3bv6H6qchvGHD-BpgF-Pt4vrC#scrollTo=WWHlukuvIpXb

Hope that helps!

4 replies

wztdream Dec 10, 2021
Author

thank you, it should be the recommended way to fix layers in flax. I just noticed there is a set_to_zero API in optax that did the same thing as the zero_grads function you defined above, so now we can directly use set_to_zero.

And also we can directly pass prefix of pytree as mask labels, which is easy to use, it will be something like:

tx = optax.multi_transform({'adam': optax.adam(0.1), 'zero': zero_grads()},
                           {"params":{"frozen1":"zero", "trainable2":"adam", "trainable3":"adam", "trainable4":"adam"}})

One problem in this case we need to unfreeze params before pass to state

from flax.core.frozen_dict import unfreeze
state = train_state.TrainState.create(apply_fn=model.apply,
                                      params=unfreeze(params),
                                      tx=tx)

matthias-wright Dec 10, 2021

Nice!

You don't have to unfreeze the params if you freeze the mask:

tx = optax.multi_transform({'adam': optax.adam(0.1), 'zero': zero_grads()},
                           frozen_dict.freeze({"params":{"frozen1":"zero", "trainable2":"adam", "trainable3":"adam", "trainable4":"adam"}}))

state = train_state.TrainState.create(apply_fn=model.apply,
                                      params=params,
                                      tx=tx)

andsteing Apr 19, 2022
Maintainer

See #1453 for a small recipe how to create the second parameter for optax.multi_transform().

varunagrawal Apr 19, 2024

Thank you for the phenomenal answer. This has been an erudite discussion.

I am using flax with Google's brax for reinforcement learning, and I am encountering a weird issue.
Brax requires the use of a dataclass container called PPONetworkParams to be passed to the optimizer.

Specifically, brax does the following:

init_params = ppo_losses.PPONetworkParams(
      policy=ppo_network.policy_network.init(key_policy),
      value=ppo_network.value_network.init(key_value))

training_state = TrainingState(  # pytype: disable=wrong-arg-types  # jax-ndarray
      optimizer_state=optimizer.init(init_params),  # pytype: disable=wrong-arg-types  # numpy-scalars
      params=init_params,
      normalizer_params=running_statistics.init_state(
          specs.Array(env_state.obs.shape[-1:], jnp.dtype('float32'))),
      env_steps=0)

The line

optimizer_state=optimizer.init(init_params),  # pytype: disable=wrong-arg-types  # numpy-scalars

works great when using the PPONetworkParams dataclass.

However, when I defined the masked optimizer as above, e.g.

optimizer = optax.multi_transform(
        {
            "adam": optax.adam(3.0e-4),
            "zero": zero_grads()
        }, {
            "Encoder_0": "zero",
            "BCModel_0": "zero",
            "ResidualPolicy_0": "adam"
        })

and then call optimizer.init_params, I get the following error:

ValueError: Expected dict, got PPONetworkParams(policy={'params': ...

Both optimizer variants are of type GradientTransformationExtraArgs, so can someone please explain what I am doing wrong here?

The full stack trace is here:

Traceback (most recent call last):
  File "/Users/scripts/train.py", line 181, in <module>
    main()
  File "/Users/scripts/train.py", line 79, in main
    tx = optimizer.init(init_params)
  File "/Users/.pyenv/versions/3.10.12/lib/python3.10/site-packages/optax/_src/combine.py", line 214, in init_fn
    inner_states = {
  File "/Users/.pyenv/versions/3.10.12/lib/python3.10/site-packages/optax/_src/combine.py", line 217, in <dictcomp>
    mask_compatible_extra_args=mask_compatible_extra_args).init(params)
  File "/Users/.pyenv/versions/3.10.12/lib/python3.10/site-packages/optax/_src/wrappers.py", line 545, in init_fn
    masked_params = mask_pytree(params, mask_tree)
  File "/Users/.pyenv/versions/3.10.12/lib/python3.10/site-packages/optax/_src/wrappers.py", line 509, in mask_pytree
    return jax.tree_util.tree_map(
  File "/Users/.pyenv/versions/3.10.12/lib/python3.10/site-packages/jax/_src/tree_util.py", line 311, in tree_map
    all_leaves = [leaves] + [treedef.flatten_up_to(r) for r in rest]
  File "/Users/.pyenv/versions/3.10.12/lib/python3.10/site-packages/jax/_src/tree_util.py", line 311, in <listcomp>
    all_leaves = [leaves] + [treedef.flatten_up_to(r) for r in rest]
ValueError: Expected dict, got PPONetworkParams(policy={'params': {'ResidualPolicy_0': {'Dense_0': {'kernel': Array([[-0.081,  0.018, -0.062, ...,  0.081,  0.015,  0.01 ],
       [ 0.018,  0.075,  0.039, ...,  0.035,  0.072, -0.018],
       [-0.046,  0.078,  0.027, ...,  0.038,  0.04 , -0.08 ],
       ...,
       [ 0.012,  0.029, -0.009, ...,  0.01 , -0.017,  0.022],
       [ 0.007,  0.034,  0.002, ..., -0.087,  0.059,  0.08 ],
       [ 0.084,  0.024, -0.017, ..., -0.061, -0.087,  0.007]], dtype=float32), 'bias': Array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)}, 'Dense_1': {'kernel': Array([[ 0.079,  0.092,  0.13 , ...,  0.01 , -0.024, -0.124],
       [ 0.037, -0.038,  0.072, ...,  0.13 ,  0.051, -0.032],
       [-0.087, -0.007, -0.026, ...,  0.051,  0.071, -0.025],
       ...,
       [ 0.033, -0.073, -0.038, ...,  0.154,  0.025,  0.062],
       [-0.182, -0.127,  0.011, ..., -0.003, -0.08 ,  0.029],
       [-0.115, -0.125,  0.094, ..., -0.041, -0.002, -0.139]], dtype=float32), 'bias': Array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)}, 'Dense_2': {'kernel': Array([[-0.008,  0.126, -0.005, ..., -0.137, -0.017,  0.149],
       [ 0.07 , -0.039, -0.172, ..., -0.043, -0.088,  0.138],
       [ 0.146, -0.02 ,  0.142, ...,  0.044,  0.013,  0.092],
       ...,
       [-0.132, -0.006,  0.021, ...,  0.028, -0.068, -0.086],
       [-0.065,  0.124,  0.024, ..., -0.073, -0.087,  0.068],
       [ 0.154, -0.026, -0.121, ...,  0.008, -0.054,  0.035]], dtype=float32), 'bias': Array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0.], dtype=float32)}}}}, value={'params': {'hidden_0': {'kernel': Array([[ 0.061, -0.025,  0.026, ..., -0.067,  0.02 , -0.025],
       [ 0.027, -0.046, -0.015, ..., -0.003,  0.04 ,  0.057],
       [-0.04 ,  0.069, -0.004, ..., -0.06 ,  0.067,  0.046],
       ...,
       [ 0.032,  0.055,  0.014, ..., -0.068,  0.02 , -0.005],
       [-0.016,  0.048, -0.002, ..., -0.07 , -0.062, -0.077],
       [-0.004, -0.064,  0.032, ...,  0.004,  0.001,  0.002]], dtype=float32), 'bias': Array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.], dtype=float32)}, 'hidden_1': {'kernel': Array([[ 0.003,  0.021, -0.014, ..., -0.06 ,  0.055,  0.096],
       [-0.028, -0.01 , -0.013, ...,  0.025,  0.016,  0.019],
       [ 0.02 ,  0.026,  0.029, ...,  0.078,  0.099, -0.078],
       ...,
       [ 0.071, -0.071,  0.03 , ...,  0.104, -0.084,  0.103],
       [ 0.011,  0.095, -0.06 , ..., -0.   ,  0.071, -0.007],
       [-0.039,  0.024, -0.053, ...,  0.051, -0.02 ,  0.008]], dtype=float32), 'bias': Array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.], dtype=float32)}, 'hidden_2': {'kernel': Array([[-0.047,  0.041, -0.093, ...,  0.084,  0.103,  0.034],
       [-0.018, -0.038,  0.016, ...,  0.066,  0.052,  0.105],
       [ 0.002,  0.044,  0.059, ..., -0.008, -0.037, -0.102],
       ...,
       [-0.045, -0.085,  0.017, ...,  0.024, -0.07 , -0.034],
       [ 0.1  , -0.042,  0.026, ...,  0.051, -0.103,  0.056],
       [ 0.002,  0.095, -0.073, ..., -0.049,  0.018, -0.077]], dtype=float32), 'bias': Array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.], dtype=float32)}, 'hidden_3': {'kernel': Array([[ 0.08 , -0.093, -0.103, ..., -0.04 ,  0.063, -0.052],
       [-0.028, -0.087, -0.056, ..., -0.017, -0.072, -0.107],
       [ 0.039, -0.024, -0.021, ..., -0.052,  0.028, -0.012],
       ...,
       [ 0.036,  0.087, -0.033, ...,  0.073, -0.05 ,  0.044],
       [-0.102,  0.069,  0.002, ...,  0.075,  0.095, -0.098],
       [-0.066, -0.019,  0.053, ...,  0.043,  0.025, -0.092]], dtype=float32), 'bias': Array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.], dtype=float32)}, 'hidden_4': {'kernel': Array([[ 0.08 , -0.102, -0.031, ...,  0.009, -0.012,  0.017],
       [ 0.038,  0.081,  0.033, ...,  0.059,  0.056,  0.018],
       [ 0.06 , -0.024,  0.034, ..., -0.046,  0.044,  0.021],
       ...,
       [-0.048, -0.052, -0.099, ..., -0.05 ,  0.016,  0.003],
       [-0.101,  0.042, -0.023, ..., -0.027, -0.016,  0.061],
       [ 0.014, -0.084, -0.028, ...,  0.079,  0.049,  0.104]], dtype=float32), 'bias': Array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.], dtype=float32)}, 'hidden_5': {'kernel': Array([[ 0.091],
       [-0.073],
       [ 0.103],
       [ 0.062],
       [-0.028],
       [ 0.078],
       [-0.066],
       [ 0.047],
       [ 0.066],
       [-0.095],
       [ 0.004],
       [ 0.017],
       [ 0.038],
       [-0.031],
       [ 0.108],
       [-0.048],
       [ 0.065],
       [ 0.099],
       [ 0.031],
       [ 0.023],
       [ 0.014],
       [ 0.025],
       [ 0.086],
       [ 0.087],
       [ 0.052],
       [ 0.029],
       [-0.079],
       [ 0.087],
       [-0.016],
       [-0.02 ],
       [-0.1  ],
       [ 0.086],
       [ 0.018],
       [-0.021],
       [-0.025],
       [ 0.068],
       [-0.068],
       [ 0.071],
       [ 0.013],
       [ 0.073],
       [ 0.076],
       [ 0.098],
       [ 0.035],
       [ 0.074],
       [-0.052],
       [ 0.044],
       [-0.073],
       [-0.006],
       [-0.033],
       [-0.091],
       [-0.037],
       [-0.006],
       [ 0.003],
       [ 0.082],
       [ 0.001],
       [-0.001],
       [ 0.083],
       [ 0.003],
       [ 0.088],
       [ 0.023],
       [-0.031],
       [ 0.027],
       [-0.016],
       [ 0.046],
       [ 0.038],
       [-0.062],
       [-0.011],
       [-0.037],
       [-0.07 ],
       [-0.065],
       [ 0.033],
       [ 0.1  ],
       [-0.084],
       [ 0.055],
       [ 0.024],
       [-0.083],
       [ 0.031],
       [ 0.048],
       [-0.1  ],
       [ 0.098],
       [-0.037],
       [ 0.084],
       [-0.004],
       [ 0.004],
       [-0.029],
       [ 0.071],
       [ 0.031],
       [ 0.069],
       [ 0.079],
       [-0.064],
       [ 0.068],
       [-0.055],
       [-0.085],
       [ 0.102],
       [-0.032],
       [-0.08 ],
       [-0.094],
       [-0.098],
       [-0.097],
       [ 0.041],
       [-0.015],
       [ 0.032],
       [ 0.048],
       [-0.073],
       [ 0.071],
       [-0.098],
       [ 0.072],
       [ 0.051],
       [-0.031],
       [ 0.078],
       [-0.001],
       [-0.052],
       [-0.011],
       [-0.003],
       [-0.003],
       [-0.01 ],
       [ 0.013],
       [-0.058],
       [-0.076],
       [ 0.107],
       [-0.014],
       [ 0.102],
       [ 0.054],
       [-0.047],
       [-0.095],
       [-0.041],
       [-0.049],
       [ 0.043],
       [-0.092],
       [ 0.016],
       [ 0.026],
       [ 0.098],
       [-0.101],
       [ 0.065],
       [-0.027],
       [ 0.085],
       [ 0.093],
       [-0.105],
       [ 0.079],
       [-0.036],
       [ 0.089],
       [-0.008],
       [-0.05 ],
       [-0.072],
       [ 0.094],
       [-0.001],
       [-0.042],
       [ 0.049],
       [-0.065],
       [-0.011],
       [-0.083],
       [-0.008],
       [ 0.011],
       [-0.032],
       [-0.052],
       [ 0.052],
       [ 0.026],
       [-0.069],
       [-0.01 ],
       [ 0.059],
       [-0.079],
       [-0.071],
       [-0.019],
       [-0.041],
       [-0.052],
       [-0.053],
       [-0.072],
       [-0.083],
       [ 0.017],
       [ 0.071],
       [ 0.067],
       [ 0.002],
       [-0.042],
       [-0.085],
       [-0.006],
       [-0.016],
       [-0.086],
       [-0.101],
       [ 0.06 ],
       [-0.067],
       [-0.052],
       [ 0.004],
       [ 0.076],
       [ 0.075],
       [-0.106],
       [-0.044],
       [-0.066],
       [ 0.086],
       [ 0.05 ],
       [-0.083],
       [ 0.105],
       [ 0.08 ],
       [ 0.103],
       [ 0.072],
       [ 0.024],
       [ 0.   ],
       [ 0.065],
       [ 0.025],
       [ 0.047],
       [-0.083],
       [ 0.014],
       [ 0.059],
       [ 0.072],
       [-0.058],
       [ 0.091],
       [-0.033],
       [-0.011],
       [-0.097],
       [-0.077],
       [ 0.049],
       [-0.058],
       [-0.053],
       [-0.081],
       [-0.032],
       [ 0.06 ],
       [-0.093],
       [-0.087],
       [-0.064],
       [-0.008],
       [-0.052],
       [-0.058],
       [ 0.072],
       [-0.031],
       [ 0.07 ],
       [-0.068],
       [-0.102],
       [ 0.045],
       [-0.104],
       [ 0.097],
       [-0.054],
       [-0.035],
       [ 0.021],
       [ 0.015],
       [ 0.064],
       [-0.006],
       [-0.008],
       [ 0.068],
       [-0.014],
       [ 0.065],
       [ 0.034],
       [ 0.002],
       [ 0.014],
       [-0.103],
       [ 0.082],
       [-0.088],
       [-0.029],
       [-0.006],
       [ 0.083],
       [ 0.063],
       [-0.049],
       [ 0.099],
       [-0.012],
       [-0.018],
       [ 0.077],
       [ 0.054],

patrickvonplaten · 2022-04-07T11:03:08Z

patrickvonplaten
Apr 7, 2022

Hey @matthias-wright,

Thanks a lot for the google colab - it's great! Do you know by any chance if this can save a significant amount of memory? In your example if the first layer is frozen it never needs to compute the gradients. E.g. ideally during the forward pass the activations should never be saved so that we can save some memomry. Is this the case here? Image you want to fine-tune a large BERT model and freeze the first 12 layers, but only train the final layer. In this case, it would be very important that no activations for the first 12 layers are computed to save memory. Is this the case here?

5 replies

matthias-wright Apr 7, 2022

Hi @patrickvonplaten, I am happy to hear that the Colab was useful! Regarding your question, I am not entirely sure but I think it does. According to the docs of the gradient transformation optax.set_to_zero() that is used:
When updates are set to zero inside the same jit-compiled function as the calculation of gradients, optax transformations, and application of updates to parameters, unnecessary computations will in general be dropped.

However, I have never made any measurements to test this.

Maybe you can test this by disabling the preallocation behavior of Jax and checking whether or not freezing some layers reduces the memory footprint?
If you do this, please let me know what you find, I am interested if this works :)

marcvanzee Apr 7, 2022
Maintainer

One thing I tried in your Colab is to freeze all layers except the last one, then run a single training step in a Colab cell with %%timeit on the first line. Then you can do the same and not freeze any layers and time it again.

This will show that freezing the layers is indeed faster!

sanchit-gandhi Apr 7, 2022

I too had a play around with the Colab - it's brilliant! Taking inspiration from @marcvanzee, I compared three different approaches:

Freezing all but the last layers using optax.multi_transform and the 'masking' approach described above;
Freezing no layers and using just optax.adam;
Freezing all but the last layers and using jax.lax.stop_gradient.

The Colab can be found here: https://colab.research.google.com/drive/1K-5bz6R6kt9GAvaUHvzYvvA-IOAO2PhL?usp=sharing

Timing the jit'd train steps, freezing using optax.multi_transform performs ~1.25x faster than both freezing no layers and jax.lax.stop_gradient over 1000 training steps (402ms vs 512ms vs 517ms).

Interesting to see that the train step time for jax.lax.stop_gradient is comparable to freezing no layers at all. To my understanding, the gradients of any frozen parameters are still computed when grad is called (in order to facilitate forward or reverse-mode automatic differentiation). However, these gradients not returned by action of the jax.lax.stop_gradient operator. So it would seem as though there is no memory saving by freezing layers using jax.lax.stop_gradient vs freezing no layers.

jheek Apr 8, 2022
Maintainer

I think the reason why the stop_gradient solution isn't faster is because the conv kernels are so small that generating the zeros for the param gradient has comparable cost to just computing the convolution. If you make the number of features larger stop_gradient should be quite a bit faster than freezing no layers but still a bit slower than using optax.multi_transform

sanchit-gandhi Apr 11, 2022

I modified the notebook to benchmark four different optimizers, training a 'tiny' seq2seq speech model comprising of ~50k parameters over 1000 training steps on TPU. Of these 50k parameters, the first ~17k are frozen. The training data is created synthetically with input/output features representative of real audio data:

Batch size: 2
Input length: 96000 (= 6 sec of audio sampled at 16kHz)
Target length: 31 tokens

The four optimizers trailed are:

No freezing - just optax.adam (baseline).
Freezing using optax.multi_transform and the 'masking' approach described above.
Freezing using jax.lax.stop_gradient.
Freezing by taking gradients of the non-frozen parameters only.

The performance results for the three optimizers that freeze model parameters are very comparable:

Optimizer	Time / s
1. No freezing	13.5
2. `multi_transform`	8.5
3. `jax.lax.stop_gradient`	8.5
4. Gradients of non-frozen params only	8.5

Notebook:
https://github.com/sanchit-gandhi/codesnippets/blob/main/check_freeze_grads.ipynb
Colab:
https://colab.research.google.com/drive/1lbkgs1MS5Pzq4_D9pyN0V68Rx6J9XIGX?usp=sharing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to fix some layers for transfer learning? #1706

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to fix some layers for transfer learning? #1706

Replies: 2 comments · 9 replies

wztdream Dec 10, 2021 Author

andsteing Apr 19, 2022 Maintainer

marcvanzee Apr 7, 2022 Maintainer

jheek Apr 8, 2022 Maintainer

Replies: 2 comments 9 replies

wztdream Dec 10, 2021
Author

andsteing Apr 19, 2022
Maintainer

marcvanzee Apr 7, 2022
Maintainer

jheek Apr 8, 2022
Maintainer