Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixture of Experts #115

Merged
merged 63 commits into from
Dec 18, 2023
Merged

Mixture of Experts #115

merged 63 commits into from
Dec 18, 2023

Conversation

kernelmachine
Copy link
Collaborator

  • Implements top-2 routing with arbitrary numbers of experts.

@mitchellnw
Copy link
Contributor

Overall this looks good to me! One thing I don't understand is why you're tracking "step" and "is_final_checkpoint"? If you've run a small training run with this and it looks good I think we should merge soon to avoid merge conflicts!

From a quick glance does anyone else notice anything that looks problematic in terms of backwards compatibility?

@sagadre
Copy link
Collaborator

sagadre commented Dec 4, 2023

i think those are maybe from an earlier PR. @kernelmachine u may need to merge with latest main

@kernelmachine
Copy link
Collaborator Author

Ooh okay will fix those merge conflicts!

Regardless, I am also dealing with some mysterious bug here, as I am seeing worse perplexities with more experts, at a budget I expect the MoE to do better than the dense model on: https://docs.google.com/spreadsheets/d/1QrjOA24wDGGXyZgn2TsAI4q4nmdz5ceQXTTF7pt8hT4/edit#gid=0

@kernelmachine
Copy link
Collaborator Author

kernelmachine commented Dec 16, 2023

This branch is ready to merge. Final benchmarking numbers on Stability:

Compute budgets

Compute type 41M 87M 160M 410M 830M
Number of nodes 1 1 1 2 4
Number of tokens 20.0B 20.0B 20.0B 20.0B 20.0B

Perplexity

Number of Experts 41M 87M 160M 410M 830M
1 27.61 18.68 14.87 10.54 9.39
8 19.85 14.66 12.26 9.82 8.84
32 20.55 15.28 14.62

Tokens/sec/GPU

Number of Experts 41M 87M 160M 410M 830M
1 141.2K 106.0K 95.5K 30.3K 16.0K
8 69.5K 66.6K 66.2K 18.5K 9.2K

Training Parameters

Number of Experts 41M 87M 160M 410M 830M
8 experts 68.9M 165.4M 360.6M 1.1B 2.4B
32 experts 164.5M 439.9M 1.0B 3.5B 7.9B

Inference Parameters

Number of Experts 41M 87M 160M 410M 830M
2 experts 45.0M 96.8M 190.7M 509.2M 1.1B

@mitchellnw
Copy link
Contributor

Looks great! Left a bunch of minor comments but other than that ready to merge.

@mitchellnw mitchellnw self-requested a review December 16, 2023 23:29
Copy link
Contributor

@mitchellnw mitchellnw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See above)

open_lm/model.py Outdated
from megablocks.layers.moe import MoE
from megablocks.layers.arguments import Arguments as MoEArgs
except ImportError:
logging.warning(f"Megablocks not installed. To train MoE, install with pip install megablocks.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have some assert here to make sure they're not using MoE?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't check args during imports though, right?

from megablocks.layers.moe import batched_load_balancing_loss, clear_load_balancing_loss
from megablocks.layers.arguments import Arguments as MoEArgs
except ImportError:
logging.warning(f"Megablocks not installed. To train MoE, install with pip install megablocks.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again maybe some assert would be good?

Copy link
Collaborator

@achalddave achalddave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I left a few comments, particularly I think some MoE things need to be implemented in the case where accum_freq == 1.

Copy link
Contributor

@Vaishaal Vaishaal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is my favorite PR, LGTM

@sagadre
Copy link
Collaborator

sagadre commented Dec 18, 2023

lgtm! might be good do do a open_lm_160m run with this branch to make sure nothing in non-moe codepath affected

Suchin Gururangan and others added 2 commits December 18, 2023 17:42
@kernelmachine
Copy link
Collaborator Author

Confirmed same training losses with main branch
Uploading image.png…

@kernelmachine kernelmachine merged commit 5610963 into main Dec 18, 2023
2 checks passed
@Muennighoff
Copy link
Contributor

Do you have thoughts why your 32 expert model performs worse than 8 in your experiments above (#115 (comment)) @kernelmachine ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants