GPU memory usage in sharded model #2925

srossi93 · 2023-03-03T09:08:48Z

srossi93
Mar 3, 2023

Hi,

So, I'm trying to play around with sharding API (#2730) but I have unexpected results regarding the memory consumption of the GPUs.

The simple setup that I want to investigate is ensembling few models, where the shard happens at model level (e.g. all model parameters and outputs are sharded on the GPUs, while the input is replicated, see below).

          ┌────────────────┐
          │ GPU#0          │
          │ ┌─────────┐    │
        ┌─► │ Model 1 ├────┼─► Output 1
        │ │ └─────────┘    │
┌─────┐ │ └────────────────┘
│Input├─┤
└─────┘ │ ┌────────────────┐
        │ │ GPU#1          │
        │ │ ┌─────────┐    │
        └─► │ Model 2 ├────┼─► Output 2
          │ └─────────┘    │
          └────────────────┘

This is my setup. Let's start with creating the device mesh and the mesh

device_mesh = mesh_utils.create_device_mesh((jax.device_count(), 1))
mesh = Mesh(devices=device_mesh, axis_names=('model', 'data'))

and a simple MLP

class MLP(nn.Module):
  features: int = 5000

  @nn.compact
  def __call__(self, x):
    y = nn.Dense(
        self.features,
        use_bias=False,
        kernel_init=nn.spmd.with_logical_partitioning(nn.initializers.xavier_normal(), ('input', 'hidden')),
    )(x)
    y = nn.spmd.with_logical_constraint(y, ('batch', None))
    y = nn.Dense(
        1,
        use_bias=False,
        kernel_init=nn.spmd.with_logical_partitioning(nn.initializers.xavier_normal(), ('hidden', 'output')),
    )(y)
    y = y.reshape(-1)
    y = nn.spmd.with_logical_constraint(y, ('batch',))
    return y

Now let's use the lifted vmap to ensemble models

EnsembleMLP = nn.vmap(MLP,
                      in_axes=None,
                      out_axes=0,
                      axis_size=2,
                      variable_axes={'params': 0},
                      metadata_params={nn.PARTITION_NAME: 'ensemble'},
                      split_rngs={'params': True})

and few extra bit for mapping the logical axis

model = EnsembleMLP()

data_spec = PartitionSpec('data', None) 
x = jnp.ones((4, 20))
rng = jax.random.PRNGKey(0)
x = jax.device_put(x, NamedSharding(mesh, data_spec))

spec = nn.get_partition_spec(jax.eval_shape(model.init, rng, x))
rules = (('batch', 'data'), ('ensemble', 'model'))

mesh_spec = nn.spmd.logical_to_mesh(spec, rules)

Now if I inspect the various parameters and variables with jax.debug.visualize_array_sharding I see that they are correctly sharded. For example, the output of the model is here

y_shape = (2, 4)
┌────────────────────────────────────────────────┐
│                                                │
│                     GPU 0                      │
│                                                │
│                                                │
├────────────────────────────────────────────────┤
│                                                │
│                     GPU 1                      │
│                                                │
│                                                │
└────────────────────────────────────────────────┘

Now, if I understood everything right I should see a reduced memory usage when I move ensembling from 1 GPU to 2 GPUs but this is not the case (I go from 2447MB to 2193MB each).
This is more clear when I know that the ensemble doesn't fit on a single device but it would do in two. Counterintuitively, in this case the (p)jit fails with OOM in both cases.

Can you help me to understand this? Am I missing something stupid?

env: 
flax==0.6.6
jax==0.4.4
jaxlib==0.4.4+cuda11.cudnn82

cgarciae · 2023-03-03T14:32:32Z

cgarciae
Mar 3, 2023
Maintainer

Hey @srossi93, I was playing around with your code. Since you didn't provide a full snippet I had to guess how you initialized the model, I also change some stuff a little bit. I currently cannot offer any advice as profiling is hard. Here is my code if its of any use:

#%%
import jax
from jax.experimental import mesh_utils
from jax.sharding import Mesh, NamedSharding, PartitionSpec
import flax.linen as nn

device_mesh = mesh_utils.create_device_mesh((jax.local_device_count(), 1))
mesh = Mesh(devices=device_mesh, axis_names=('model', 'data'))
print(mesh)

#%%
class MLP(nn.Module):
  features: int = 5000

  @nn.compact
  def __call__(self, x):
    y = nn.Dense(
      self.features,
      use_bias=False,
      kernel_init=nn.spmd.with_logical_partitioning(nn.initializers.xavier_normal(), ('input', 'hidden')),
    )(x)
    y = nn.spmd.with_logical_constraint(y, ('batch', None))
    y = nn.Dense(
      1,
      use_bias=False,
      kernel_init=nn.spmd.with_logical_partitioning(nn.initializers.xavier_normal(), ('hidden', 'output')),
    )(y)
    y = y.reshape(-1)
    y = nn.spmd.with_logical_constraint(y, ('batch',))
    return y
  
EnsembleMLP = nn.vmap(
  MLP,
  in_axes=None,
  out_axes=1,
  axis_size=jax.local_device_count(),
  variable_axes={'params': 0},
  metadata_params={nn.PARTITION_NAME: 'ensemble'},
  split_rngs={'params': True},
)

model = EnsembleMLP()
rules = (('ensemble', 'model'), ('batch', 'data'))

#%% Partition Data
data_spec = PartitionSpec() 
x = jax.random.normal(jax.random.PRNGKey(0), (8, 20))
rng = jax.random.PRNGKey(0)
x = jax.device_put(x, NamedSharding(mesh, data_spec))
print(x.shape)
jax.debug.visualize_array_sharding(x)

#%% Partition Model
@jax.jit
def create_variables():
  variables = model.init(rng, x)
  spec = nn.get_partition_spec(variables)
  mesh_spec = nn.logical_to_mesh(spec, rules)
  variables = nn.unbox(variables)
  variables = jax.tree_map(
    lambda p, s: jax.lax.with_sharding_constraint(p, NamedSharding(mesh, s)), variables, mesh_spec)
  return variables

variables = create_variables()
kernel = variables['params']['Dense_0']['kernel']
kernel = kernel.reshape(kernel.shape[0], -1)
print(kernel.shape)
jax.debug.visualize_array_sharding(kernel)

# %%
@jax.jit
def forward(variables, x):
  return model.apply(variables, x)

y = forward(variables, x)
print(y.shape)
jax.debug.visualize_array_sharding(y)

# %%

6 replies

cgarciae Mar 3, 2023
Maintainer

Hey @srossi93, when you are trying to optimize the use of memory, because you are trying to max out each device, you should use the donate_argnums argument on either jit or pjit so you reuses the same input/output buffers for the params.

andyehrenberg Mar 8, 2023

I'm also noticing that you're not using nn.spmd.set_logical_axis_rules(rules), so the with_logical_constraints in MLP's __call__ won't have access to your rules.

srossi93 Mar 8, 2023
Author

Hi @andyehrenberg, thanks or the tip. Just to confirm, I need to set the logical axis rules before model creation, correct?

cgarciae Mar 8, 2023
Maintainer

@srossi93 you can use the nn.logical_axis_rules context manager over the jitted functions, you also need the context manager for the mesh (see #2938 and #2939) e.g:

with nn.logical_axis_rules(rules), mesh:
  params = init_fn(rng, x)

Else with_logical_constraint will behave like no-ops (#2939 might change this).

srossi93 Mar 8, 2023
Author

Thanks both. As soon as I have 10 free minutes I’m trying again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory usage in sharded model #2925

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

GPU memory usage in sharded model #2925

srossi93 Mar 3, 2023

Replies: 1 comment · 6 replies

cgarciae Mar 3, 2023 Maintainer

cgarciae Mar 3, 2023 Maintainer

andyehrenberg Mar 8, 2023

srossi93 Mar 8, 2023 Author

cgarciae Mar 8, 2023 Maintainer

srossi93 Mar 8, 2023 Author

srossi93
Mar 3, 2023

Replies: 1 comment 6 replies

cgarciae
Mar 3, 2023
Maintainer

cgarciae Mar 3, 2023
Maintainer

srossi93 Mar 8, 2023
Author

cgarciae Mar 8, 2023
Maintainer

srossi93 Mar 8, 2023
Author