-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed
Description
@willmj I noticed there is some inconsistency in the logic, although the behavior is correct
- When creating the ScatterMoE we use
num_experts_per_device. In the caseep_degree > 1, then this will result in a the routerweightshavingnum_experts_per_deviceoutputs. - But the
routerweights need to be replicated across device, and this does happen in load_experts_onto_device, because the state_dictsdloaded here, will always result in the full-sized router
So we end up with this inconsistency
(Pdb) mod
Linear(in_features=1536, out_features=20, bias=False)
(Pdb) mod.weight.shape
torch.Size([40, 1536])
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed