Model Parallelism with TrainState #1988

agemagician · 2022-03-13T00:11:54Z

agemagician
Mar 13, 2022

Hello,

In HuggingFace library, they used the "TrainState" for training non-parallel models, while they couldn't do it when they used model parallelism.
https://github.com/huggingface/transformers/blob/master/examples/flax/language-modeling/run_clm_flax.py
https://github.com/huggingface/transformers/blob/master/examples/research_projects/jax-projects/model_parallel/run_clm_mp.py

Is there a reason for not being able to use TrainState with model parallelism?
Is it possible to provide a minimal Colab example for using both model parallelism "PartitionSpec" and "TrainState" ?

Answered by marcvanzee

Mar 16, 2022

There is no reason in principle why TrainState could not be used with model parallelism.

The reason why this doesn't work well (yet) in HuggingFace is, I believe, particular to their API and only applies to very large models. They are actually thinking about this in huggingface/transformers#15766.

If you are looking for a simple example that uses TrainState with pjit, this example may be useful: https://colab.sandbox.google.com/github/marcvanzee/flax/blob/pjit-example/examples/siren/siren.ipynb

@patrickvonplaten

View full answer

marcvanzee · 2022-03-16T13:15:55Z

marcvanzee
Mar 16, 2022
Maintainer

There is no reason in principle why TrainState could not be used with model parallelism.

The reason why this doesn't work well (yet) in HuggingFace is, I believe, particular to their API and only applies to very large models. They are actually thinking about this in huggingface/transformers#15766.

If you are looking for a simple example that uses TrainState with pjit, this example may be useful: https://colab.sandbox.google.com/github/marcvanzee/flax/blob/pjit-example/examples/siren/siren.ipynb

@patrickvonplaten

2 replies

patrickvonplaten Mar 16, 2022

Yes working on it :-) cc @patil-suraj

agemagician Mar 16, 2022
Author

Thanks a lot, @marcvanzee for the detailed answer.
@patrickvonplaten hopefully, we will see it soon on HF :)

agemagician · 2022-03-23T21:35:12Z

agemagician
Mar 23, 2022
Author

Hi @marcvanzee ,

I am trying to run your Colab example, but it doesn't work.

The following code block :

model = Siren(depth, hidden_layers, out_features)
x = jnp.ones((batch_size, in_features))

# We use the `axis_rules` context manager, which will add metadata to 
# `variables` specifying how to partition each parameter.
with partitioning.axis_rules(rules):
  variables = jax.jit(model.init)(random.PRNGKey(0), x)

params = variables['params']

# We define a train state using the parameters of our variables.
tx = optax.adam(learning_rate)
state = train_state.TrainState.create(
    apply_fn=model.apply, params=params, tx=tx)

Gives the following error:


---------------------------------------------------------------------------
JaxStackTraceBeforeTransformation         Traceback (most recent call last)
[/usr/lib/python3.7/runpy.py](https://localhost:8080/#) in _run_module_as_main(***failed resolving arguments***)
    192     return _run_code(code, main_globals, None,
--> 193                      "__main__", mod_spec)
    194 

49 frames
[/usr/lib/python3.7/runpy.py](https://localhost:8080/#) in _run_code(***failed resolving arguments***)
     84                        __spec__ = mod_spec)
---> 85     exec(code, run_globals)
     86     return run_globals

[/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py](https://localhost:8080/#) in <module>()
     15     from ipykernel import kernelapp as app
---> 16     app.launch_new_instance()

[/usr/local/lib/python3.7/dist-packages/traitlets/config/application.py](https://localhost:8080/#) in launch_instance(***failed resolving arguments***)
    845         app.initialize(argv)
--> 846         app.start()
    847 

[/usr/local/lib/python3.7/dist-packages/ipykernel/kernelapp.py](https://localhost:8080/#) in start(***failed resolving arguments***)
    498         try:
--> 499             self.io_loop.start()
    500         except KeyboardInterrupt:

[/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py](https://localhost:8080/#) in start(***failed resolving arguments***)
    131             asyncio.set_event_loop(self.asyncio_loop)
--> 132             self.asyncio_loop.run_forever()
    133         finally:

[/usr/lib/python3.7/asyncio/base_events.py](https://localhost:8080/#) in run_forever(***failed resolving arguments***)
    540             while True:
--> 541                 self._run_once()
    542                 if self._stopping:

[/usr/lib/python3.7/asyncio/base_events.py](https://localhost:8080/#) in _run_once(***failed resolving arguments***)
   1785             else:
-> 1786                 handle._run()
   1787         handle = None  # Needed to break cycles when an exception occurs.

[/usr/lib/python3.7/asyncio/events.py](https://localhost:8080/#) in _run(***failed resolving arguments***)
     87         try:
---> 88             self._context.run(self._callback, *self._args)
     89         except Exception as exc:

[/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py](https://localhost:8080/#) in _handle_events(***failed resolving arguments***)
    121         fileobj, handler_func = self.handlers[fd]
--> 122         handler_func(fileobj, events)
    123 

[/usr/local/lib/python3.7/dist-packages/tornado/stack_context.py](https://localhost:8080/#) in null_wrapper(***failed resolving arguments***)
    299                 _state.contexts = cap_contexts[0]
--> 300                 return fn(*args, **kwargs)
    301             finally:

[/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py](https://localhost:8080/#) in _handle_events(***failed resolving arguments***)
    451             if zmq_events & zmq.POLLIN and self.receiving():
--> 452                 self._handle_recv()
    453                 if not self.socket:

[/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py](https://localhost:8080/#) in _handle_recv(***failed resolving arguments***)
    480                 callback = self._recv_callback
--> 481                 self._run_callback(callback, msg)
    482 

[/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py](https://localhost:8080/#) in _run_callback(***failed resolving arguments***)
    430             # inside our blanket exception handler rather than outside.
--> 431             callback(*args, **kwargs)
    432         except Exception:

[/usr/local/lib/python3.7/dist-packages/tornado/stack_context.py](https://localhost:8080/#) in null_wrapper(***failed resolving arguments***)
    299                 _state.contexts = cap_contexts[0]
--> 300                 return fn(*args, **kwargs)
    301             finally:

[/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py](https://localhost:8080/#) in dispatcher(***failed resolving arguments***)
    282             def dispatcher(msg):
--> 283                 return self.dispatch_shell(stream, msg)
    284             return dispatcher

[/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py](https://localhost:8080/#) in dispatch_shell(***failed resolving arguments***)
    232             try:
--> 233                 handler(stream, idents, msg)
    234             except Exception:

[/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py](https://localhost:8080/#) in execute_request(***failed resolving arguments***)
    398         reply_content = self.do_execute(code, silent, store_history,
--> 399                                         user_expressions, allow_stdin)
    400 

[/usr/local/lib/python3.7/dist-packages/ipykernel/ipkernel.py](https://localhost:8080/#) in do_execute(***failed resolving arguments***)
    207         try:
--> 208             res = shell.run_cell(code, store_history=store_history, silent=silent)
    209         finally:

[/usr/local/lib/python3.7/dist-packages/ipykernel/zmqshell.py](https://localhost:8080/#) in run_cell(***failed resolving arguments***)
    536         self._last_traceback = None
--> 537         return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
    538 

[/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py](https://localhost:8080/#) in run_cell(***failed resolving arguments***)
   2717                 has_raised = self.run_ast_nodes(code_ast.body, cell_name,
-> 2718                    interactivity=interactivity, compiler=compiler, result=result)
   2719 

[/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py](https://localhost:8080/#) in run_ast_nodes(***failed resolving arguments***)
   2821                 code = compiler(mod, cell_name, "exec")
-> 2822                 if self.run_code(code, result):
   2823                     return True

[/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py](https://localhost:8080/#) in run_code(***failed resolving arguments***)
   2881                 #rprint('Running code', repr(code_obj)) # dbg
-> 2882                 exec(code_obj, self.user_global_ns, self.user_ns)
   2883             finally:

[<ipython-input-12-0ca9b08a175d>](https://localhost:8080/#) in <module>()
      6 with partitioning.axis_rules(rules):
----> 7   variables = jax.jit(model.init)(random.PRNGKey(0), x)
      8 

[<ipython-input-7-2ea0a75cb196>](https://localhost:8080/#) in __call__(***failed resolving arguments***)
     49     sine_dense = partial(SineDense, omega_0=self.omega_0, depth=self.depth)
---> 50     y = sine_dense(is_first=True, axes=self.axes[:2])(x)
     51 

[<ipython-input-7-2ea0a75cb196>](https://localhost:8080/#) in __call__(***failed resolving arguments***)
     28     y = DenseP(self.depth, kernel_init=kernel_init, 
---> 29                bias_init=bias_init(in_features), axes=self.axes)(x)
     30     y = jnp.sin(y * self.omega_0)

[<ipython-input-6-fde0c5daf376>](https://localhost:8080/#) in __call__(***failed resolving arguments***)
     54                              (inputs.shape[-1], self.features),
---> 55                              axes=self.axes)
     56     kernel = jnp.asarray(kernel, self.dtype)

[/usr/local/lib/python3.7/dist-packages/flax/linen/partitioning.py](https://localhost:8080/#) in param_with_axes(***failed resolving arguments***)
    293     module_param = with_sharding_constraint(module_param,
--> 294                                             pjit.PartitionSpec(*axes))
    295     # record logical axis constraint for global axis metadata

[/usr/local/lib/python3.7/dist-packages/flax/linen/partitioning.py](https://localhost:8080/#) in with_sharding_constraint(***failed resolving arguments***)
    206                                 is_leaf=lambda x: isinstance(x, tuple))
--> 207   return _with_sharding_constraint(x, axis_resources)
    208 

[/usr/local/lib/python3.7/dist-packages/flax/linen/partitioning.py](https://localhost:8080/#) in _with_sharding_constraint(***failed resolving arguments***)
    191   else:
--> 192     return pjit.with_sharding_constraint(x, axis_resources)
    193 

[/usr/local/lib/python3.7/dist-packages/jax/experimental/pjit.py](https://localhost:8080/#) in with_sharding_constraint(***failed resolving arguments***)
    948   outs = [sharding_constraint_p.bind(y, axis_resources=r, resource_env=resource_env)
--> 949           for y, r in safe_zip(x_flat, axis_resources_flat)]
    950   return tree_unflatten(tree, outs)

[/usr/local/lib/python3.7/dist-packages/jax/experimental/pjit.py](https://localhost:8080/#) in <listcomp>(***failed resolving arguments***)
    948   outs = [sharding_constraint_p.bind(y, axis_resources=r, resource_env=resource_env)
--> 949           for y, r in safe_zip(x_flat, axis_resources_flat)]
    950   return tree_unflatten(tree, outs)

JaxStackTraceBeforeTransformation: AttributeError: 'ReplicaAxisContext' object has no attribute 'manual_axes'

The preceding stack trace is the source of the JAX operation that, once transformed by JAX, triggered the following exception.

--------------------

The above exception was the direct cause of the following exception:

UnfilteredStackTrace                      Traceback (most recent call last)
[<ipython-input-12-0ca9b08a175d>](https://localhost:8080/#) in <module>()
      6 with partitioning.axis_rules(rules):
----> 7   variables = jax.jit(model.init)(random.PRNGKey(0), x)
      8 

[/usr/local/lib/python3.7/dist-packages/jax/_src/traceback_util.py](https://localhost:8080/#) in reraise_with_filtered_traceback(*args, **kwargs)
    161     try:
--> 162       return fun(*args, **kwargs)
    163     except Exception as e:

[/usr/local/lib/python3.7/dist-packages/jax/_src/api.py](https://localhost:8080/#) in cache_miss(*args, **kwargs)
    434         device=device, backend=backend, name=flat_fun.__name__,
--> 435         donated_invars=donated_invars, inline=inline)
    436     out_pytree_def = out_tree()

[/usr/local/lib/python3.7/dist-packages/jax/core.py](https://localhost:8080/#) in bind(self, fun, *args, **params)
   1708   def bind(self, fun, *args, **params):
-> 1709     return call_bind(self, fun, *args, **params)
   1710 

[/usr/local/lib/python3.7/dist-packages/jax/core.py](https://localhost:8080/#) in call_bind(primitive, fun, *args, **params)
   1720   tracers = map(top_trace.full_raise, args)
-> 1721   outs = top_trace.process_call(primitive, fun, tracers, params)
   1722   return map(full_lower, apply_todos(env_trace_todo(), outs))

[/usr/local/lib/python3.7/dist-packages/jax/core.py](https://localhost:8080/#) in process_call(self, primitive, f, tracers, params)
    613   def process_call(self, primitive, f, tracers, params):
--> 614     return primitive.impl(f, *tracers, **params)
    615   process_map = process_call

[/usr/local/lib/python3.7/dist-packages/jax/_src/dispatch.py](https://localhost:8080/#) in _xla_call_impl(***failed resolving arguments***)
    142   compiled_fun = _xla_callable(fun, device, backend, name, donated_invars,
--> 143                                *unsafe_map(arg_spec, args))
    144   try:

[/usr/local/lib/python3.7/dist-packages/jax/linear_util.py](https://localhost:8080/#) in memoized_fun(fun, *args)
    271     else:
--> 272       ans = call(fun, *args)
    273       cache[key] = (ans, fun.stores)

[/usr/local/lib/python3.7/dist-packages/jax/_src/dispatch.py](https://localhost:8080/#) in _xla_callable_uncached(fun, device, backend, name, donated_invars, *arg_specs)
    169   return lower_xla_callable(fun, device, backend, name, donated_invars,
--> 170                             *arg_specs).compile().unsafe_call
    171 

[/usr/local/lib/python3.7/dist-packages/jax/_src/profiler.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    205     with TraceAnnotation(name, **decorator_kwargs):
--> 206       return func(*args, **kwargs)
    207     return wrapper

[/usr/local/lib/python3.7/dist-packages/jax/_src/dispatch.py](https://localhost:8080/#) in lower_xla_callable(fun, device, backend, name, donated_invars, *arg_specs)
    259         module_name, closed_jaxpr, backend.platform,
--> 260         mlir.ReplicaAxisContext(axis_env), name_stack, donated_invars)
    261   else:

[/usr/local/lib/python3.7/dist-packages/jax/interpreters/mlir.py](https://localhost:8080/#) in lower_jaxpr_to_module(module_name, jaxpr, platform, axis_context, name_stack, donated_args, replicated_args, arg_shardings, result_shardings)
    493         arg_shardings=arg_shardings, result_shardings=result_shardings,
--> 494         input_output_aliases=input_output_aliases)
    495 

[/usr/local/lib/python3.7/dist-packages/jax/interpreters/mlir.py](https://localhost:8080/#) in lower_jaxpr_to_fun(ctx, name, jaxpr, public, replace_units_with_dummy, replace_tokens_with_dummy, replicated_args, arg_shardings, result_shardings, input_output_aliases)
    636                              jaxpr.jaxpr, map(ir_constants, jaxpr.consts),
--> 637                              *args)
    638     outs = []

[/usr/local/lib/python3.7/dist-packages/jax/interpreters/mlir.py](https://localhost:8080/#) in jaxpr_subcomp(ctx, jaxpr, consts, *args)
    722       ans = rule(rule_ctx, *map(_unwrap_singleton_ir_values, in_nodes),
--> 723                  **eqn.params)
    724 

[/usr/local/lib/python3.7/dist-packages/jax/experimental/pjit.py](https://localhost:8080/#) in _sharding_constraint_mhlo_lowering(ctx, x_node, axis_resources, resource_env)
    991               ctx.module_context.axis_context,
--> 992               allow_uneven_axes=True),
    993           unspecified_dims=get_unconstrained_dims(axis_resources))

[/usr/local/lib/python3.7/dist-packages/jax/experimental/pjit.py](https://localhost:8080/#) in get_aval_sharding_proto(aval, axis_resources, mesh, axis_ctx, allow_uneven_axes)
   1039     axis_names = mesh.axis_names
-> 1040     for manual_axis in axis_ctx.manual_axes:
   1041       special_axes[axis_names.index(manual_axis)] = xc.OpSharding.Type.MANUAL

UnfilteredStackTrace: AttributeError: 'ReplicaAxisContext' object has no attribute 'manual_axes'

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
[<ipython-input-12-0ca9b08a175d>](https://localhost:8080/#) in <module>()
      5 # `variables` specifying how to partition each parameter.
      6 with partitioning.axis_rules(rules):
----> 7   variables = jax.jit(model.init)(random.PRNGKey(0), x)
      8 
      9 params = variables['params']

[/usr/local/lib/python3.7/dist-packages/jax/experimental/pjit.py](https://localhost:8080/#) in _sharding_constraint_mhlo_lowering(ctx, x_node, axis_resources, resource_env)
    990               mesh,
    991               ctx.module_context.axis_context,
--> 992               allow_uneven_axes=True),
    993           unspecified_dims=get_unconstrained_dims(axis_resources))
    994   ]

[/usr/local/lib/python3.7/dist-packages/jax/experimental/pjit.py](https://localhost:8080/#) in get_aval_sharding_proto(aval, axis_resources, mesh, axis_ctx, allow_uneven_axes)
   1038   if axis_ctx is not None:
   1039     axis_names = mesh.axis_names
-> 1040     for manual_axis in axis_ctx.manual_axes:
   1041       special_axes[axis_names.index(manual_axis)] = xc.OpSharding.Type.MANUAL
   1042   return sharding_spec.sharding_proto(special_axes=special_axes)

AttributeError: 'ReplicaAxisContext' object has no attribute 'manual_axes'

I have tried different versions of JAX but it always gives the same error.

Any idea what is the problem here and how to fix it ?

7 replies

jheek Mar 24, 2022
Maintainer

Setting everything to None will work here although it might eventually run out of memory when the number of params is greater than the memory capacity of one GPU. Eventually you probably want to use jax.eval_shape(pjit(model.init, None, None)) to get the param_spec before doing the actual init.

agemagician Mar 24, 2022
Author

Setting everything to None will work here although it might eventually run out of memory when the number of params is greater than the memory capacity of one GPU. Eventually you probably want to use jax.eval_shape(pjit(model.init, None, None)) to get the param_spec before doing the actual init.

I understand the memory issue, but I don't understand the changes needed to solve it.
Could you please share the re-written code that solves this issue for this block of code in the colab example:

model = Siren(depth, hidden_layers, out_features)
x = jnp.ones((batch_size, in_features))

# We use the `axis_rules` context manager, which will add metadata to 
# `variables` specifying how to partition each parameter.
with partitioning.axis_rules(rules):
    variables = jax.jit(pjit(model.init, in_axis_resources=None,
        out_axis_resources=None))(random.PRNGKey(0), x)

params = variables['params']

# We define a train state using the parameters of our variables.
tx = optax.adam(learning_rate)
state = train_state.TrainState.create(
    apply_fn=model.apply, params=params, tx=tx)

jheek Mar 24, 2022
Maintainer

I haven't experimented with this myself so I'll wait for @marcvanzee to update the Colab fully. But here's the general (untested) idea:

model = Siren(depth, hidden_layers, out_features)
x = jnp.ones((batch_size, in_features))

# We use the `axis_rules` context manager, which will add metadata to 
# `variables` specifying how to partition each parameter.
with partitioning.axis_rules(rules):
    variables_shapes = jax.eval_shape(pjit(model.init, in_axis_resources=None,
        out_axis_resources=None), random.PRNGKey(0), x)
params_axes = partitioning.get_axis_names(variable_shapes['params_axes'])
vars_pspec = jax.tree_map(lambda x: PartitionSpec(*(rd[k] for k in x)), params_axes)
with partitioning.axis_rules(rules):
    variables = jax.jit(pjit(model.init, in_axis_resources=None,
        out_axis_resources=vars_psec), random.PRNGKey(0), x)

agemagician Mar 24, 2022
Author

I haven't experimented with this myself so I'll wait for @marcvanzee to update the Colab fully. But here's the general (untested) idea:

model = Siren(depth, hidden_layers, out_features)
x = jnp.ones((batch_size, in_features))

# We use the `axis_rules` context manager, which will add metadata to 
# `variables` specifying how to partition each parameter.
with partitioning.axis_rules(rules):
    variables_shapes = jax.eval_shape(pjit(model.init, in_axis_resources=None,
        out_axis_resources=None), random.PRNGKey(0), x)
params_axes = partitioning.get_axis_names(variable_shapes['params_axes'])
vars_pspec = jax.tree_map(lambda x: PartitionSpec(*(rd[k] for k in x)), params_axes)
with partitioning.axis_rules(rules):
    variables = jax.jit(pjit(model.init, in_axis_resources=None,
        out_axis_resources=vars_psec), random.PRNGKey(0), x)

Thanks a lot for providing the code.
I have tested and it is not working, there are some syntax errors.
Some I was able to fix like "variable_shapes" should be "variables_shapes", but I could not fix the "(*(rd[k] for k in x))" because I didn't know what is "rd" which is missing.

I will wait then for an updated/tested version from you or from @marcvanzee .

Thanks again for your effort.

jheek Mar 24, 2022
Maintainer

For rd it's just rd = {k:v for k,v in rules}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Parallelism with TrainState #1988

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Model Parallelism with TrainState #1988

agemagician Mar 13, 2022

Replies: 2 comments · 9 replies

marcvanzee Mar 16, 2022 Maintainer

patrickvonplaten Mar 16, 2022

agemagician Mar 16, 2022 Author

agemagician Mar 23, 2022 Author

jheek Mar 24, 2022 Maintainer

agemagician Mar 24, 2022 Author

jheek Mar 24, 2022 Maintainer

agemagician Mar 24, 2022 Author

jheek Mar 24, 2022 Maintainer

agemagician
Mar 13, 2022

Replies: 2 comments 9 replies

marcvanzee
Mar 16, 2022
Maintainer

agemagician Mar 16, 2022
Author

agemagician
Mar 23, 2022
Author

jheek Mar 24, 2022
Maintainer

agemagician Mar 24, 2022
Author

jheek Mar 24, 2022
Maintainer

agemagician Mar 24, 2022
Author

jheek Mar 24, 2022
Maintainer