Skip to content

Problem when finetuning with all heads #1773

@SunXT-0719

Description

@SunXT-0719

Hi fairchem team!
I'm on a research of finetuning uma-s-1p1 with all heads pretrained. I see the PR #1766 and delete the head part in my finetune config.
uma_sm_finetune_template.yaml
from:

    model:
      _target_: fairchem.core.units.mlip_unit.mlip_unit.initialize_finetuning_model
      checkpoint_location:
        _target_: fairchem.core.calculate.pretrained_mlip.pretrained_checkpoint_path_from_name
        model_name: ${base_model_name}
      overrides:
        backbone:
          otf_graph: true
          max_neighbors: ${max_neighbors}
          regress_stress: ${data.regress_stress}
          always_use_pbc: false
        pass_through_head_outputs: ${data.pass_through_head_outputs}
      heads: ${data.heads}

to:

    model:
      _target_: fairchem.core.units.mlip_unit.mlip_unit.initialize_finetuning_model
      checkpoint_location:
        _target_: fairchem.core.calculate.pretrained_mlip.pretrained_checkpoint_path_from_name
        model_name: ${base_model_name}
      overrides:
        backbone:
          otf_graph: true
          max_neighbors: ${max_neighbors}
          regress_stress: ${data.regress_stress}
          always_use_pbc: false
        pass_through_head_outputs: ${data.pass_through_head_outputs}

I do some basic debugs to make the programm run but the loss at step 0 is still abnormally high, which is:

INFO:root:{'train/loss': 8990.620638182878, 'train/lr': 1e-05, 'train/step': 0, 'train/epoch': 0.0, 'train/samples_per_second(approx)': 6.1664387292354395, 'train/atoms_per_second(approx)': 197.7114417561113, 'train/num_atoms_on_rank': 1026, 'train/num_samples_on_rank': 32}
/data/sunxuetin/anaconda3/envs/UMA/lib/python3.12/site-packages/torch/optim/lr_scheduler.py:332: UserWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
  _warn_get_lr_called_within_step(self)
INFO:root:{'train/loss': 11867.852604975855, 'train/lr': 1e-05, 'train/step': 0, 'train/epoch': 0.0, 'train/samples_per_second(approx)': 6.168348108451524, 'train/atoms_per_second(approx)': 197.5799003488379, 'train/num_atoms_on_rank': 1025, 'train/num_samples_on_rank': 32}
/data/sunxuetin/anaconda3/envs/UMA/lib/python3.12/site-packages/torch/optim/lr_scheduler.py:332: UserWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
  _warn_get_lr_called_within_step(self)
INFO:root:{'train/loss': 10284.808734129449, 'train/lr': 1e-05, 'train/step': 0, 'train/epoch': 0.0, 'train/samples_per_second(approx)': 6.165504733162675, 'train/atoms_per_second(approx)': 197.68149550702827, 'train/num_atoms_on_rank': 1026, 'train/num_samples_on_rank': 32}
/data/sunxuetin/anaconda3/envs/UMA/lib/python3.12/site-packages/torch/optim/lr_scheduler.py:332: UserWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
  _warn_get_lr_called_within_step(self)
INFO:root:{'train/loss': 10899.674122548213, 'train/lr': 1e-05, 'train/step': 0, 'train/epoch': 0.0, 'train/samples_per_second(approx)': 6.131760968243467, 'train/atoms_per_second(approx)': 195.83311592327573, 'train/num_atoms_on_rank': 1022, 'train/num_samples_on_rank': 32}

which is same as the loss using re-initialized heads.

Do I need to do anything else when editing the config to make sure the heads are succefully loaded?
Thanks for your reply!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions