Mixture of Experts #115

kernelmachine · 2023-11-27T17:21:55Z

Implements top-2 routing with arbitrary numbers of experts.

mitchellnw · 2023-12-04T02:21:19Z

Overall this looks good to me! One thing I don't understand is why you're tracking "step" and "is_final_checkpoint"? If you've run a small training run with this and it looks good I think we should merge soon to avoid merge conflicts!

From a quick glance does anyone else notice anything that looks problematic in terms of backwards compatibility?

sagadre · 2023-12-04T03:55:07Z

i think those are maybe from an earlier PR. @kernelmachine u may need to merge with latest main

kernelmachine · 2023-12-05T18:59:12Z

Ooh okay will fix those merge conflicts!

Regardless, I am also dealing with some mysterious bug here, as I am seeing worse perplexities with more experts, at a budget I expect the MoE to do better than the dense model on: https://docs.google.com/spreadsheets/d/1QrjOA24wDGGXyZgn2TsAI4q4nmdz5ceQXTTF7pt8hT4/edit#gid=0

open_lm/main.py

kernelmachine · 2023-12-16T04:52:46Z

This branch is ready to merge. Final benchmarking numbers on Stability:

Compute budgets

Compute type	41M	87M	160M	410M	830M
Number of nodes	1	1	1	2	4
Number of tokens	20.0B	20.0B	20.0B	20.0B	20.0B

Perplexity

Number of Experts	41M	87M	160M	410M	830M
1	27.61	18.68	14.87	10.54	9.39
8	19.85	14.66	12.26	9.82	8.84
32	20.55	15.28	14.62

Tokens/sec/GPU

Number of Experts	41M	87M	160M	410M	830M
1	141.2K	106.0K	95.5K	30.3K	16.0K
8	69.5K	66.6K	66.2K	18.5K	9.2K

Training Parameters

Number of Experts	41M	87M	160M	410M	830M
8 experts	68.9M	165.4M	360.6M	1.1B	2.4B
32 experts	164.5M	439.9M	1.0B	3.5B	7.9B

Inference Parameters

Number of Experts	41M	87M	160M	410M	830M
2 experts	45.0M	96.8M	190.7M	509.2M	1.1B

mitchellnw · 2023-12-16T23:24:12Z

Looks great! Left a bunch of minor comments but other than that ready to merge.

mitchellnw

(See above)

open_lm/model.py

mitchellnw · 2023-12-16T23:19:29Z

open_lm/model.py

+    from megablocks.layers.moe import MoE
+    from megablocks.layers.arguments import Arguments as MoEArgs
+except ImportError:
+    logging.warning(f"Megablocks not installed. To train MoE, install with pip install megablocks.")


Should we have some assert here to make sure they're not using MoE?

we can't check args during imports though, right?

mitchellnw · 2023-12-16T23:21:39Z

open_lm/train.py

+    from megablocks.layers.moe import batched_load_balancing_loss, clear_load_balancing_loss
+    from megablocks.layers.arguments import Arguments as MoEArgs
+except ImportError:
+    logging.warning(f"Megablocks not installed. To train MoE, install with pip install megablocks.")


Again maybe some assert would be good?

open_lm/train.py

achalddave

This looks great! I left a few comments, particularly I think some MoE things need to be implemented in the case where accum_freq == 1.

open_lm/train.py

Vaishaal

this is my favorite PR, LGTM

sagadre · 2023-12-18T17:04:53Z

lgtm! might be good do do a open_lm_160m run with this branch to make sure nothing in non-moe codepath affected

kernelmachine · 2023-12-18T18:47:12Z

Confirmed same training losses with main branch

Muennighoff · 2024-04-09T17:14:40Z

Do you have thoughts why your 32 expert model performs worse than 8 in your experiments above (#115 (comment)) @kernelmachine ?

Suchin Gururangan and others added 16 commits November 21, 2023 00:46

update

8fcc9b9

works on one gpu

7093f83

added moe params

b6ba69f

works on multiple gpus

8b4c230

update

45c1849

eval works

66359d9

update

7466d60

update

440f9f7

update

5c37fdb

update

429a780

update

930174f

removed experiment dir

a302c66

removed experiments dir

6483ef2

removed custom fsdp

d0ea8dc

update

4eb3dd0

update

a46a2fe

kernelmachine requested review from achalddave and Vaishaal November 27, 2023 18:39

added expert gradient norm

f5421f6

Suchin Gururangan and others added 8 commits December 12, 2023 00:45

update

e254615

update

0912ba8

Update model.py

0a4d0a3

added load balancing loss

53905bc

Merge branch 'moe' of github.com:mlfoundations/open_lm into moe

a4d0efc

update

52e2cfd

update

ebacdbe

update

f404b0d

Suchin Gururangan and others added 5 commits December 15, 2023 23:31

update

da9d4a7

Merge branch 'main' into moe

5741fde

Merge branch 'main' of github.com:mlfoundations/open_lm into moe

a70609b

update

e722226

Merge branch 'moe' of github.com:mlfoundations/open_lm into moe

5980bf8

achalddave reviewed Dec 16, 2023

View reviewed changes

open_lm/main.py Outdated Show resolved Hide resolved

update

ce3c838

Suchin Gururangan added 5 commits December 16, 2023 04:53

update

8b7c77d

update

8daf040

update

dd552d8

update

1c9ff6f

update

9e2aa30

mitchellnw self-requested a review December 16, 2023 23:29

mitchellnw approved these changes Dec 16, 2023

View reviewed changes

achalddave requested changes Dec 17, 2023

View reviewed changes

open_lm/train.py Outdated Show resolved Hide resolved

open_lm/train.py Outdated Show resolved Hide resolved

open_lm/train.py Outdated Show resolved Hide resolved

update

5e78109

Vaishaal approved these changes Dec 18, 2023

View reviewed changes

update

7ea0385

achalddave approved these changes Dec 18, 2023

View reviewed changes

Suchin Gururangan and others added 2 commits December 18, 2023 17:42

update

7b68747

Merge branch 'main' into moe

a50fed2

kernelmachine merged commit 5610963 into main Dec 18, 2023
2 checks passed

achalddave mentioned this pull request Dec 21, 2023

Add support for sparse mixture of experts (MoE) #23

Closed

Muennighoff mentioned this pull request Apr 17, 2024

MoE performs worse than equivalent dense model? #253

Closed

drothermel mentioned this pull request Sep 16, 2024

Strip Down OpenLM for a Clean Start drothermel/llm_explore#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixture of Experts #115

Mixture of Experts #115

kernelmachine commented Nov 27, 2023

mitchellnw commented Dec 4, 2023

sagadre commented Dec 4, 2023

kernelmachine commented Dec 5, 2023

kernelmachine commented Dec 16, 2023 •

edited

Loading

mitchellnw commented Dec 16, 2023

mitchellnw left a comment

mitchellnw Dec 16, 2023

kernelmachine Dec 18, 2023

mitchellnw Dec 16, 2023

achalddave left a comment

Vaishaal left a comment

sagadre commented Dec 18, 2023

kernelmachine commented Dec 18, 2023

Muennighoff commented Apr 9, 2024

Mixture of Experts #115

Mixture of Experts #115

Conversation

kernelmachine commented Nov 27, 2023

mitchellnw commented Dec 4, 2023

sagadre commented Dec 4, 2023

kernelmachine commented Dec 5, 2023

kernelmachine commented Dec 16, 2023 • edited Loading

Compute budgets

Perplexity

Tokens/sec/GPU

Training Parameters

Inference Parameters

mitchellnw commented Dec 16, 2023

mitchellnw left a comment

Choose a reason for hiding this comment

mitchellnw Dec 16, 2023

Choose a reason for hiding this comment

kernelmachine Dec 18, 2023

Choose a reason for hiding this comment

mitchellnw Dec 16, 2023

Choose a reason for hiding this comment

achalddave left a comment

Choose a reason for hiding this comment

Vaishaal left a comment

Choose a reason for hiding this comment

sagadre commented Dec 18, 2023

kernelmachine commented Dec 18, 2023

Muennighoff commented Apr 9, 2024

kernelmachine commented Dec 16, 2023 •

edited

Loading