Skip to content

Logic Inconsistency In ScatterMoE during Expert ParallelΒ #121

@fabianlim

Description

@fabianlim

@willmj I noticed there is some inconsistency in the logic, although the behavior is correct

  1. When creating the ScatterMoE we use num_experts_per_device. In the case ep_degree > 1, then this will result in a the router weights having num_experts_per_device outputs.
  2. But the router weights need to be replicated across device, and this does happen in load_experts_onto_device, because the state_dict sd loaded here, will always result in the full-sized router

So we end up with this inconsistency

(Pdb) mod
Linear(in_features=1536, out_features=20, bias=False)
(Pdb) mod.weight.shape
torch.Size([40, 1536])

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions