Question about weight decay in Muon #418
Replies: 1 comment
-
|
I observed very minor improvements, and very slightly better with the cautious weight decay form. I'll push it to master shortly. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
I noticed that in nanochat the Muon optimizer is used without weight decay.
In recent work on Muon (e.g. “Muon is Scalable for LLM Training”), decoupled weight decay is sometimes applied as a post-update shrinkage and reported to improve stability and final quality.
Is there a specific reason why weight decay is omitted for Muon here?
Is it mainly for simplicity, or did you observe no benefit (or negative effects) in practice?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions