Support frozen weights

# 🎯 **Goal (What & Why)**

Fast-LLM creates gradient and optimizer state buffers for all parameters even if they are frozen. This degrades both memory usage and speedups from freezing weights, and is a blocker for LoRA (#149). 

# 🚀 **Execution Plan**

### **Step 1: What is the smallest working version?**
* Create a separate buffer for frozen weights that doesn't have gradients. It can be stored in training precision and will need to be restored separately when zero-3 is involved.
* Add checkpoint support for such buffers. Distributed checkpoints will be trivial, but imports/experts will need additional work. Note: freezing weights is not considered part of the architecture, yet will necessarily change the weight layout, further breaking the architecture/non-architecture split, making things like #166 and #171 more relevant.

### **Step 2: What additional optimizations are possible (later, out-of-scope for nowl)?**  
* Avoid storing a separate full-precision copy (shard) of the frozen weights when the 2-bit copy is enough. This will prevent excessive state memory usage when using a small number of gpus (up to ~3x for single-gpu)
* Avoid reconstructing the frozen weights  on every training step if they don't need to be. This will save a whole lot of unnecessary communication and potential network overhead with ZeRO-1/2
* Weight freezing is not considered part of the architecture, yet will necessarily change the weight layout. We'll need additional safety checks to avoid accidental misuse (ex. loading distributed checkpoints in the wrong format). Note: this further breaks the architecture/non-architecture split, making things like #166 and #171 more relevant.

# 📌 **Acceptance Criteria** (Must-Haves for Completion)
* Things work as described above

# 🛠️ **Project Management**
- [x] **Assign the project to the Fast-LLM project.**
- [ ] **Set the `Estimate` field (in days) in the GitHub project.**
- [ ] **Use the `Size` field to categorize the PR size (Small/Medium/Large).**
- [ ] **Assign an owner when opening the issue.**  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support frozen weights #183

🎯 Goal (What & Why)

🚀 Execution Plan

Step 1: What is the smallest working version?

Step 2: What additional optimizations are possible (later, out-of-scope for nowl)?

📌 Acceptance Criteria (Must-Haves for Completion)

🛠️ Project Management

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support frozen weights #183

Description

🎯 Goal (What & Why)

🚀 Execution Plan

Step 1: What is the smallest working version?

Step 2: What additional optimizations are possible (later, out-of-scope for nowl)?

📌 Acceptance Criteria (Must-Haves for Completion)

🛠️ Project Management

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions