-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixture of Experts #115
Mixture of Experts #115
Conversation
kernelmachine
commented
Nov 27, 2023
- Implements top-2 routing with arbitrary numbers of experts.
Overall this looks good to me! One thing I don't understand is why you're tracking "step" and "is_final_checkpoint"? If you've run a small training run with this and it looks good I think we should merge soon to avoid merge conflicts! From a quick glance does anyone else notice anything that looks problematic in terms of backwards compatibility? |
i think those are maybe from an earlier PR. @kernelmachine u may need to merge with latest main |
Ooh okay will fix those merge conflicts! Regardless, I am also dealing with some mysterious bug here, as I am seeing worse perplexities with more experts, at a budget I expect the MoE to do better than the dense model on: https://docs.google.com/spreadsheets/d/1QrjOA24wDGGXyZgn2TsAI4q4nmdz5ceQXTTF7pt8hT4/edit#gid=0 |
This branch is ready to merge. Final benchmarking numbers on Stability: Compute budgets
Perplexity
Tokens/sec/GPU
Training Parameters
Inference Parameters
|
Looks great! Left a bunch of minor comments but other than that ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(See above)
open_lm/model.py
Outdated
from megablocks.layers.moe import MoE | ||
from megablocks.layers.arguments import Arguments as MoEArgs | ||
except ImportError: | ||
logging.warning(f"Megablocks not installed. To train MoE, install with pip install megablocks.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have some assert here to make sure they're not using MoE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't check args during imports though, right?
from megablocks.layers.moe import batched_load_balancing_loss, clear_load_balancing_loss | ||
from megablocks.layers.arguments import Arguments as MoEArgs | ||
except ImportError: | ||
logging.warning(f"Megablocks not installed. To train MoE, install with pip install megablocks.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again maybe some assert would be good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I left a few comments, particularly I think some MoE things need to be implemented in the case where accum_freq == 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is my favorite PR, LGTM
lgtm! might be good do do a open_lm_160m run with this branch to make sure nothing in non-moe codepath affected |
Do you have thoughts why your 32 expert model performs worse than 8 in your experiments above (#115 (comment)) @kernelmachine ? |