Skip to content

[bug] Resuming experiment in distributed format with frozen weights #256

Closed
@RaymondLi0

Description

@RaymondLi0

🐞 Describe the Bug

Error when resuming an experiment from distributed format, with a different set of frozen weights.

2025-05-07 19:22:37,146 [Rank 05] Traceback (most recent call last):
  File "/app/fast_llm/tools/cli.py", line 29, in fast_llm
    Runnable.parse_and_run(unparsed)
  File "/app/fast_llm/engine/config_utils/runnable.py", line 36, in parse_and_run
    runnable()
  File "/app/fast_llm/engine/training/config.py", line 423, in runnable
    trainer.run()
  File "/app/fast_llm/engine/training/trainer.py", line 172, in run
    self._run_training()
  File "/app/fast_llm/engine/training/trainer.py", line 175, in _run_training
    self._prepare_training_state()
  File "/app/fast_llm/engine/training/trainer.py", line 456, in _prepare_training_state
    self._load_checkpoint(self._config.training.checkpoint, last_iteration)
  File "/app/fast_llm/engine/training/trainer.py", line 525, in _load_checkpoint
    metadata = self._multi_stage.load_checkpoint(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/fast_llm/engine/multi_stage/fast_llm_model.py", line 40, in load_checkpoint
    metadata = converter.load(config)
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/app/fast_llm/engine/checkpoint/distributed.py", line 122, in load
    self_fsdp.copy_shard_overlaps(
  File "/app/fast_llm/engine/multi_stage/fsdp.py", line 455, in copy_shard_overlaps
    shard[begin:end][overlap_mask] = loaded_shards[shard_name][overlap_index_map_masked]
                                     ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: index is out of bounds for dimension with size 0

🔄 Steps to Reproduce

Steps to reproduce the behavior:

fast-llm version: 286f9d3a2d5daec175a20bea11613405a5c53b71 (main)

1 - Pretraining run with mlp_lr_scale=0.0
2 - Load that pretrained model in distributed format, with mlp_lr_scale=1.0

🎯 Expected Behavior

No crash

📝 Additional Context

Originally, the bug I observed was on this branch: #243
In a first run, I set the lr-scale of the embedding/output weights to zero. Then un-freeze the output weights in a subsequent run. There was no crash but a very high loss at the beginning of training.
Resuming from the hugging-face format instead of distributed worked fine.
In an attempt to reproduce this issue on main, I got the above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions