Description
🎯 Goal (What & Why)
Currently, a pretrained config overrides an arbitrary part of the user-specified config. This causes a lot of troubles:
- We typically override the architecture parameters only, which complicates things when also want to throw in some non-architecture parameters Add non-architecture Huggingface conversion parameters #166
- Some default values are set before loading the pretrained configs and end up being wrong [bug] Inconsistent init_method_std in test_load_distributed_checkpoint_dp2 #88.
- Values set in the config are silently ignored and can be confusing. Ex the actual hidden size may end up being 4096 when the config explicitly says 2048.
I suggest flipping things around so the specified model config overrides the pretrained config. This should give us the behaviour we want in most cases:
- Pretrained config, no base model config: All architecture parameters are imported, and so are relevant non-architecture parameters (ex.
window_size
). Other non-architecture parameters take the Fast-LLM default. - Pretrained config, base model config with non-architecture parameters: Parameters explicitly specified in the base model config are taken, others are as above.
- Pretrained config, base model config with architecture parameters: We probably want to enforce matching values, and raise an error for any mismatch. (This would be an improvement because right now wrong values are silently ignored.)
- No pretrained config: Same as before.
🚀 Execution Plan
We can use Fast-LLM's override mechanism as in #168.
However, we'll also need to adapt the update mechanism to get the behaviour we want for nested configs.
It could also be difficult to achieve backward compatibility.
📌 Acceptance Criteria (Must-Haves for Completion)
- Things should work as described above
🛠️ Project Management
- Assign the project to the Fast-LLM project.
- Set the
Estimate
field (in days) in the GitHub project. - Use the
Size
field to categorize the PR size (Small/Medium/Large). - Assign an owner when opening the issue.