Skip to content

Stuck at calling runners.run() #1827

@Student204161

Description

@Student204161

What would you like to report?

I am trying to train eSCAIP on the OC22 dataset, which has been formatted into the ase.db format but run into problems.

Using my organisation's HPC, which handles jobs using SLURM, my attempt at training from scratch on the OC22 dataset gets stuck for 2+ hours with the log:
/home/energy/s204161/newT/envs/escaip_env/lib/python3.11/site-packages/hydra/plugins/config_source.py:125: UserWarning: Support for .yml files is deprecated. Use .yaml extension for Hydra config files
deprecation_warning(
INFO:root:Setting up distributed backend...
INFO:root:Calling runner.run() ...

I have tried only using a small subset of the data, and waited but it didnt help. I suspect something deadlocks the job and results in nothing happening but I am not sure.

In case someone has helpful advice or knows something, below is how the slurm job, cluster & main config look like,

my slurm job:

`#!/bin/bash
#SBATCH --job-name=OC22
#SBATCH --mail-type=NONE
#SBATCH --partition=h200
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --gres=gpu:1
#SBATCH --time=8:00:00
#SBATCH --output=./job_out_debug.log
#SBATCH --error=./job_err_debug.log

module load Python/3.11.3-GCCcore-12.3.0
source /home/energy/s204161/newT/envs/escaip_env/bin/activate

fairchem -c oc22_escaip_M1.yml
#HYDRA_FULL_ERROR=1 fairchem -c oc22_escaip_M1.yml
`

The cluster config:

run_dir: /home/energy/s204161/newT/src/scripts/escaip/M1/runs
data_root_dir: /home/energy/s204161/newT/datasets
mode: LOCAL
device: CUDA
ranks_per_node: 1
dataloader_workers: 8
timeout_hr: 72
debug: True
mem_gb: 128
cpus_per_task: 24 #or maybe 1?
partition: scavenge
additional_parameters: null

The main config:

defaults:
  - cluster: nilf_cluster
  - backbone: H640L10
  - dataset: oc22_mini
  - element_refs: oc22_refs
  - tasks: oc22_direct
  - _self_

job:
  device_type: ${cluster.device}
  scheduler:
    mode: ${cluster.mode}
    ranks_per_node: ${cluster.ranks_per_node}
    num_nodes: 1 # we only have 1 gpu so this has to be 1, maybe 2 in future.
    slurm:
      mem_gb: ${cluster.mem_gb}
      timeout_hr: ${cluster.timeout_hr}
      partition: ${cluster.partition}
      cpus_per_task: ${cluster.cpus_per_task}
      additional_parameters: ${cluster.additional_parameters}
  debug: ${cluster.debug}
  run_dir: ${cluster.run_dir}
  run_name: escaip_oc22_M1_direct_24n
  logger: null
    # _target_: fairchem.core.common.logger.WandBSingletonLogger.init_wandb
    # _partial_: true
    # entity: khalil
    # project: oc22_M1

# cpu_graph: True
max_neighbors: 30
max_neighbors_pad_size: 45
cutoff_radius: 6
epochs: 3
steps: null # 80B atoms, 128 ranks, max atoms 350 (mean atoms 300)
max_atoms: 300
normalizer_rmsd: 1.423
direct_forces_coef: 60
oc22_energy_coef: 20

regress_stress: False
direct_forces: True
use_pbc: True
#oc22_forces_key: forces

dataset_list: ["oc22"]

train_dataset:
  _target_: fairchem.core.datasets.mt_concat_dataset.create_concat_dataset
  dataset_configs:
    # omc: ${dataset.omc_train}
    # omol: ${dataset.omol_train}
    # odac: ${dataset.odac_train}
    # omat: ${dataset.omat_train}
    oc22: ${dataset.oc22_train}
  combined_dataset_config:
    sampling:
      type: explicit
      ratios:
        # omol.train: 4.0
        oc22.train: 1.0
        # omc.train: 2.0
        # odac.train: 1.0
        # omat.train: 2.0

val_dataset:
  _target_: fairchem.core.datasets.mt_concat_dataset.create_concat_dataset
  dataset_configs:
    # omc: ${dataset.omc_val}
    # omol: ${dataset.omol_val}
    # odac: ${dataset.odac_val}
    # omat: ${dataset.omat_val}
    oc22: ${dataset.oc22_val}
  combined_dataset_config: { sampling: {type: temperature, temperature: 1.0} }

train_dataloader:
  _target_: fairchem.core.components.common.dataloader_builder.get_dataloader
  dataset: ${train_dataset}
  batch_sampler_fn:
    _target_: fairchem.core.datasets.samplers.max_atom_distributed_sampler.MaxAtomDistributedBatchSampler
    _partial_: True
    max_atoms: ${max_atoms}
    shuffle: True
    seed: 0
  num_workers: ${cluster.dataloader_workers}
  collate_fn:
    _target_: fairchem.core.units.mlip_unit.mlip_unit.mt_collater_adapter
    tasks: ${tasks}

eval_dataloader:
  _target_: fairchem.core.components.common.dataloader_builder.get_dataloader
  dataset: ${val_dataset}
  batch_sampler_fn:
    _target_: fairchem.core.datasets.samplers.max_atom_distributed_sampler.MaxAtomDistributedBatchSampler
    _partial_: True
    max_atoms: ${max_atoms}
    shuffle: False
    seed: 0
  num_workers: ${cluster.dataloader_workers}
  collate_fn:
    _target_: fairchem.core.units.mlip_unit.mlip_unit.mt_collater_adapter
    tasks: ${tasks}

heads:
  oc22_energy:
    module: fairchem.core.models.escaip.EScAIP.EScAIPEnergyHead
  oc22_forces:
    module: fairchem.core.models.escaip.EScAIP.EScAIPDirectForceHead

runner:
  _target_: fairchem.core.components.train.train_runner.TrainEvalRunner
  train_dataloader: ${train_dataloader}
  eval_dataloader: ${eval_dataloader}
  train_eval_unit:
    _target_: fairchem.core.units.mlip_unit.mlip_unit.MLIPTrainEvalUnit
    job_config: ${job}
    tasks: ${tasks}
    model:
      _target_: fairchem.core.models.base.HydraModel
      backbone: ${backbone}
      heads: ${heads}
    optimizer_fn:
      _target_: torch.optim.AdamW
      _partial_: true
      lr: 8e-4
      weight_decay: 1e-3
    cosine_lr_scheduler_fn:
      _target_: fairchem.core.units.mlip_unit.mlip_unit._get_consine_lr_scheduler
      _partial_: true
      warmup_factor: 0.2
      warmup_epochs: 0.2
      lr_min_factor: 0.01
      epochs: ${epochs}
      steps: ${steps}
    print_every: 10
    clip_grad_norm: 100
  max_epochs: ${epochs}
  max_steps: ${steps}
  evaluate_every_n_steps: 5000
  callbacks:
    - _target_: fairchem.core.common.profiler_utils.ProfilerCallback
      job_config: ${job}
    - _target_: fairchem.core.components.train.train_runner.TrainCheckpointCallback
      checkpoint_every_n_steps: 2000
      max_saved_checkpoints: 5


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions