integration with mup? #2066
Unanswered
nestordemeure
asked this question in
Q&A
Replies: 1 comment
-
I think this type of cutting-edge research is really cool, and it is very exciting to see it being applied to Flax as well! I suppose the work will consist of extending their API so that it cannot just handle Pytorch modules, but Flax modules as well. We are more than happy to advice on specific questions that arise, or limitations to Flax that pop up, either filed as Github issues or Github Discussions. Given that there does not seem to be anything actionable for our team right now, I have converted your Issue to a Discussion. Please let me know if you have any further questions, or you'd like to add anything! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Is there any interest in integrating maximal update parametrization (mup) with Flax? It is a way to alter parameters initialization and the optimizer such that the optimal value of the hyperparameters (such as the learning rate) stays stable across model size. The work is interesting in that it lets you optimize your hyperparameters on a small model then train your large model (furthermore it stabilises the parameters across training which has intriguing applications to optimizer research).
The existing implementation is based on Pytorch: they introduce a parameters initialisation function (the code pattern is similar to the one already used in Flax) and a modified optimizer.
I believe this could be a good fit for Flax as Flax is already used to train very large models (such as PaLM), relies on optax for its optimization (optax optimizers are composeable and one might introduce a wrapper to make them mup compatible) and already has a separate-weights-initialisation pattern (whereas they had to introduce it as a foreign concept in Pytorch): the port might be relatively straightforward.
I asked the mup team and they appear available and motivated to cooperate on a Flax version.
Beta Was this translation helpful? Give feedback.
All reactions