[bug] Resuming experiment in `distributed` format with frozen weights

# 🐞 Describe the Bug

Error when resuming an experiment from `distributed` format, with a different set of frozen weights.

```
2025-05-07 19:22:37,146 [Rank 05] Traceback (most recent call last):
  File "/app/fast_llm/tools/cli.py", line 29, in fast_llm
    Runnable.parse_and_run(unparsed)
  File "/app/fast_llm/engine/config_utils/runnable.py", line 36, in parse_and_run
    runnable()
  File "/app/fast_llm/engine/training/config.py", line 423, in runnable
    trainer.run()
  File "/app/fast_llm/engine/training/trainer.py", line 172, in run
    self._run_training()
  File "/app/fast_llm/engine/training/trainer.py", line 175, in _run_training
    self._prepare_training_state()
  File "/app/fast_llm/engine/training/trainer.py", line 456, in _prepare_training_state
    self._load_checkpoint(self._config.training.checkpoint, last_iteration)
  File "/app/fast_llm/engine/training/trainer.py", line 525, in _load_checkpoint
    metadata = self._multi_stage.load_checkpoint(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/fast_llm/engine/multi_stage/fast_llm_model.py", line 40, in load_checkpoint
    metadata = converter.load(config)
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/app/fast_llm/engine/checkpoint/distributed.py", line 122, in load
    self_fsdp.copy_shard_overlaps(
  File "/app/fast_llm/engine/multi_stage/fsdp.py", line 455, in copy_shard_overlaps
    shard[begin:end][overlap_mask] = loaded_shards[shard_name][overlap_index_map_masked]
                                     ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: index is out of bounds for dimension with size 0
```

# 🔄 Steps to Reproduce

Steps to reproduce the behavior:

fast-llm version: `286f9d3a2d5daec175a20bea11613405a5c53b71` (main)

1 - Pretraining run with `mlp_lr_scale=0.0`
2 - Load that pretrained model in `distributed` format, with `mlp_lr_scale=1.0`

# 🎯 Expected Behavior

No crash

# 📝 Additional Context

Originally, the bug I observed was on this branch: https://github.com/ServiceNow/Fast-LLM/pull/243
In a first run, I set the lr-scale of the embedding/output weights to zero. Then un-freeze the output weights in a subsequent run. There was no crash but a very high loss at the beginning of training.
Resuming from the hugging-face format instead of distributed worked fine.
In an attempt to reproduce this issue on main, I got the above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug] Resuming experiment in `distributed` format with frozen weights #256

🐞 Describe the Bug

🔄 Steps to Reproduce

🎯 Expected Behavior

📝 Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[bug] Resuming experiment in distributed format with frozen weights #256

Description

🐞 Describe the Bug

🔄 Steps to Reproduce

🎯 Expected Behavior

📝 Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[bug] Resuming experiment in `distributed` format with frozen weights #256