-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MoE performs worse than equivalent dense model? #253
Comments
Great catch, thanks @Muennighoff! I think this is because the MoE defaults from megablocks differ from our default dense model, in at least two ways: their ffn uses Gelu without gating ( I'm not sure what the easiest way to reconcile these is. Probably we want a custom version of the MLP class here https://github.com/databricks/megablocks/blob/2724ff6775ee7e2a41001a7979c0ec84c417cd84/megablocks/layers/mlp.py#L81-L137 that implements swiglu and our init function. |
For reference this is the code I am running: No MoE:
MoE w/ 8 experts:
|
Afaict for the numbers reported here #115 (comment) the "1 expert" model is still an MoE, correct?
I also get the result that the 8-expert MoE is better than the 1-expert one, however, both are worse than a dense model. In the below graph
OpenLM-41M
is a 41M dense model and the above two are 8-expert & 1-expert models with 41M active parameters.I would expect the 1-expert to roughly match the dense one & the 8-expert to be better than both but maybe I am missing something? @kernelmachine @sagadre
(My setup follows the main README & https://github.com/mlfoundations/open_lm/blob/main/MOE.md)
The text was updated successfully, but these errors were encountered: