Closed
Description
🐞 Describe the Bug
Error when resuming an experiment from distributed
format, with a different set of frozen weights.
2025-05-07 19:22:37,146 [Rank 05] Traceback (most recent call last):
File "/app/fast_llm/tools/cli.py", line 29, in fast_llm
Runnable.parse_and_run(unparsed)
File "/app/fast_llm/engine/config_utils/runnable.py", line 36, in parse_and_run
runnable()
File "/app/fast_llm/engine/training/config.py", line 423, in runnable
trainer.run()
File "/app/fast_llm/engine/training/trainer.py", line 172, in run
self._run_training()
File "/app/fast_llm/engine/training/trainer.py", line 175, in _run_training
self._prepare_training_state()
File "/app/fast_llm/engine/training/trainer.py", line 456, in _prepare_training_state
self._load_checkpoint(self._config.training.checkpoint, last_iteration)
File "/app/fast_llm/engine/training/trainer.py", line 525, in _load_checkpoint
metadata = self._multi_stage.load_checkpoint(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/fast_llm/engine/multi_stage/fast_llm_model.py", line 40, in load_checkpoint
metadata = converter.load(config)
^^^^^^^^^^^^^^^^^^^^^^
File "/app/fast_llm/engine/checkpoint/distributed.py", line 122, in load
self_fsdp.copy_shard_overlaps(
File "/app/fast_llm/engine/multi_stage/fsdp.py", line 455, in copy_shard_overlaps
shard[begin:end][overlap_mask] = loaded_shards[shard_name][overlap_index_map_masked]
~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: index is out of bounds for dimension with size 0
🔄 Steps to Reproduce
Steps to reproduce the behavior:
fast-llm version: 286f9d3a2d5daec175a20bea11613405a5c53b71
(main)
1 - Pretraining run with mlp_lr_scale=0.0
2 - Load that pretrained model in distributed
format, with mlp_lr_scale=1.0
🎯 Expected Behavior
No crash
📝 Additional Context
Originally, the bug I observed was on this branch: #243
In a first run, I set the lr-scale of the embedding/output weights to zero. Then un-freeze the output weights in a subsequent run. There was no crash but a very high loss at the beginning of training.
Resuming from the hugging-face format instead of distributed worked fine.
In an attempt to reproduce this issue on main, I got the above.