This is the github repo for the paper "physics of skill learning" (TBA).
Research question: Do skills learn in series or in parallel? It is probably a mixture of both, but how much do they learn in series vs in parallel?
When skills are independent, it is natural to expect the learning is in parallel. However, we observe the Domino effect, which shows that
skills tend to learn sequentially, and notably, some skills start to learn right after other skills finish
learning. For example, when we train two independent sparse parity tasks (with frequencies
Language models are demonstrating impressive skills in, e.g., coding and mathematics. Many tasks, including language modeling, are complex composite tasks that can be decomposed into many atomic skills. The learning dynamics of skills appear to be complex and intriguing: Throughout training, a skill can be completely learned, partially learned, or not learned at all. Even for learned skills, they could display quite diverse learning curves, including sudden jumps (grokking), gradual improvements, or non-monotonic oscillations. Despite the diverse phenomenology observed in real-world experiments, our intuitive understanding of them is quite limited. Intuitive understanding, or physics-style understanding, has the potential to bridge between theory (mathematics-like understanding) and experiments (engineering-like understanding).
To gain some intuition about skill learning, we take physicists’ approach of abstraction and simplification (see illustration below): when trying to understand a cow in the wild, physicists would make assumptions to simplify the subject matter. It is science but also art to determine the appropriate level of abstraction and simplification. As Einstein famously put it, “Everything should be made as simple as possible, but not simpler.” In the same philosophy, we will propose three models trading off between reality and simplicity – the Geometry model, the Resource model and the Domino model. Each of these models is able to capture some realistic aspects of rich skill dynamics.
Good physics-like theories are inspired by experimental observations, and should be able to make predictions testable by new experiments. We stick to this philosophy by applying the toy models to many topics in deep learning, including neural scaling laws, optimization, task dependency and modularity. Although these toy models are extremely simple, they are able to characterize key aspects of real-world learning dynamics.
We aim to make examples minimal and self-contained, so examples are coded in jupyter notebooks.
- Geometry model: geometry_model.ipynb
- Resource model: resource_model.ipynb
- Domino model: domino_model.ipynb
To reproduce figures in the paper, see the folder ./scripts
. File names start with "Figx_", indicating correspondence to which figure.