improvements from Moonlight #16

ehartford · 2025-02-27T04:13:11Z

MoonshotAI suggested some improvements to Muon in training Moonlight.

https://github.com/MoonshotAI/Moonlight/blob/master/examples/toy_train.py

Perhaps they would make a good addition to Muon?

…I/Moonlight/blob/master/examples/toy_train.py

toothacher17 · 2025-02-27T04:55:58Z

hi, @ehartford thanks for mentioning it! To be fair, I think the core ideas of training Moonlight are:

weight decay
adjusting update rms by matrix shape
matching to AdamW RMS

For point 1 and 2, Keller's current impl should already contain it (as we mentioned in the paper) under the setting of nanogpt. For point 3, it is mostly designed for large scale over-train setting and might not be the best setting under the nanogpt speedrun (small scale of model, small scale of tokens)

lin72h · 2025-03-04T09:26:45Z

@toothacher17 Nice suggestion!

implement some improvements suggested in https://github.com/MoonshotA…

bea0be0

…I/Moonlight/blob/master/examples/toy_train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improvements from Moonlight #16

improvements from Moonlight #16

ehartford commented Feb 27, 2025

toothacher17 commented Feb 27, 2025 •

edited

Loading

lin72h commented Mar 4, 2025

improvements from Moonlight #16

Are you sure you want to change the base?

improvements from Moonlight #16

Conversation

ehartford commented Feb 27, 2025

toothacher17 commented Feb 27, 2025 • edited Loading

lin72h commented Mar 4, 2025

toothacher17 commented Feb 27, 2025 •

edited

Loading