Description
🎯 Goal (What & Why)
Fast-LLM creates gradient and optimizer state buffers for all parameters even if they are frozen. This degrades both memory usage and speedups from freezing weights, and is a blocker for LoRA (#149).
🚀 Execution Plan
Step 1: What is the smallest working version?
- Create a separate buffer for frozen weights that doesn't have gradients. It can be stored in training precision and will need to be restored separately when zero-3 is involved.
- Add checkpoint support for such buffers. Distributed checkpoints will be trivial, but imports/experts will need additional work. Note: freezing weights is not considered part of the architecture, yet will necessarily change the weight layout, further breaking the architecture/non-architecture split, making things like Add non-architecture Huggingface conversion parameters #166 and [Prototype] Make the model config override the pretrained config #171 more relevant.
Step 2: What additional optimizations are possible (later, out-of-scope for nowl)?
- Avoid storing a separate full-precision copy (shard) of the frozen weights when the 2-bit copy is enough. This will prevent excessive state memory usage when using a small number of gpus (up to ~3x for single-gpu)
- Avoid reconstructing the frozen weights on every training step if they don't need to be. This will save a whole lot of unnecessary communication and potential network overhead with ZeRO-1/2
- Weight freezing is not considered part of the architecture, yet will necessarily change the weight layout. We'll need additional safety checks to avoid accidental misuse (ex. loading distributed checkpoints in the wrong format). Note: this further breaks the architecture/non-architecture split, making things like Add non-architecture Huggingface conversion parameters #166 and [Prototype] Make the model config override the pretrained config #171 more relevant.
📌 Acceptance Criteria (Must-Haves for Completion)
- Things work as described above
🛠️ Project Management
- Assign the project to the Fast-LLM project.
- Set the
Estimate
field (in days) in the GitHub project. - Use the
Size
field to categorize the PR size (Small/Medium/Large). - Assign an owner when opening the issue.