-
Notifications
You must be signed in to change notification settings - Fork 444
Description
What would you like to report?
I am trying to train eSCAIP on the OC22 dataset, which has been formatted into the ase.db format but run into problems.
Using my organisation's HPC, which handles jobs using SLURM, my attempt at training from scratch on the OC22 dataset gets stuck for 2+ hours with the log:
/home/energy/s204161/newT/envs/escaip_env/lib/python3.11/site-packages/hydra/plugins/config_source.py:125: UserWarning: Support for .yml files is deprecated. Use .yaml extension for Hydra config files
deprecation_warning(
INFO:root:Setting up distributed backend...
INFO:root:Calling runner.run() ...
I have tried only using a small subset of the data, and waited but it didnt help. I suspect something deadlocks the job and results in nothing happening but I am not sure.
In case someone has helpful advice or knows something, below is how the slurm job, cluster & main config look like,
my slurm job:
`#!/bin/bash
#SBATCH --job-name=OC22
#SBATCH --mail-type=NONE
#SBATCH --partition=h200
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --gres=gpu:1
#SBATCH --time=8:00:00
#SBATCH --output=./job_out_debug.log
#SBATCH --error=./job_err_debug.log
module load Python/3.11.3-GCCcore-12.3.0
source /home/energy/s204161/newT/envs/escaip_env/bin/activate
fairchem -c oc22_escaip_M1.yml
#HYDRA_FULL_ERROR=1 fairchem -c oc22_escaip_M1.yml
`
The cluster config:
run_dir: /home/energy/s204161/newT/src/scripts/escaip/M1/runs
data_root_dir: /home/energy/s204161/newT/datasets
mode: LOCAL
device: CUDA
ranks_per_node: 1
dataloader_workers: 8
timeout_hr: 72
debug: True
mem_gb: 128
cpus_per_task: 24 #or maybe 1?
partition: scavenge
additional_parameters: null
The main config:
defaults:
- cluster: nilf_cluster
- backbone: H640L10
- dataset: oc22_mini
- element_refs: oc22_refs
- tasks: oc22_direct
- _self_
job:
device_type: ${cluster.device}
scheduler:
mode: ${cluster.mode}
ranks_per_node: ${cluster.ranks_per_node}
num_nodes: 1 # we only have 1 gpu so this has to be 1, maybe 2 in future.
slurm:
mem_gb: ${cluster.mem_gb}
timeout_hr: ${cluster.timeout_hr}
partition: ${cluster.partition}
cpus_per_task: ${cluster.cpus_per_task}
additional_parameters: ${cluster.additional_parameters}
debug: ${cluster.debug}
run_dir: ${cluster.run_dir}
run_name: escaip_oc22_M1_direct_24n
logger: null
# _target_: fairchem.core.common.logger.WandBSingletonLogger.init_wandb
# _partial_: true
# entity: khalil
# project: oc22_M1
# cpu_graph: True
max_neighbors: 30
max_neighbors_pad_size: 45
cutoff_radius: 6
epochs: 3
steps: null # 80B atoms, 128 ranks, max atoms 350 (mean atoms 300)
max_atoms: 300
normalizer_rmsd: 1.423
direct_forces_coef: 60
oc22_energy_coef: 20
regress_stress: False
direct_forces: True
use_pbc: True
#oc22_forces_key: forces
dataset_list: ["oc22"]
train_dataset:
_target_: fairchem.core.datasets.mt_concat_dataset.create_concat_dataset
dataset_configs:
# omc: ${dataset.omc_train}
# omol: ${dataset.omol_train}
# odac: ${dataset.odac_train}
# omat: ${dataset.omat_train}
oc22: ${dataset.oc22_train}
combined_dataset_config:
sampling:
type: explicit
ratios:
# omol.train: 4.0
oc22.train: 1.0
# omc.train: 2.0
# odac.train: 1.0
# omat.train: 2.0
val_dataset:
_target_: fairchem.core.datasets.mt_concat_dataset.create_concat_dataset
dataset_configs:
# omc: ${dataset.omc_val}
# omol: ${dataset.omol_val}
# odac: ${dataset.odac_val}
# omat: ${dataset.omat_val}
oc22: ${dataset.oc22_val}
combined_dataset_config: { sampling: {type: temperature, temperature: 1.0} }
train_dataloader:
_target_: fairchem.core.components.common.dataloader_builder.get_dataloader
dataset: ${train_dataset}
batch_sampler_fn:
_target_: fairchem.core.datasets.samplers.max_atom_distributed_sampler.MaxAtomDistributedBatchSampler
_partial_: True
max_atoms: ${max_atoms}
shuffle: True
seed: 0
num_workers: ${cluster.dataloader_workers}
collate_fn:
_target_: fairchem.core.units.mlip_unit.mlip_unit.mt_collater_adapter
tasks: ${tasks}
eval_dataloader:
_target_: fairchem.core.components.common.dataloader_builder.get_dataloader
dataset: ${val_dataset}
batch_sampler_fn:
_target_: fairchem.core.datasets.samplers.max_atom_distributed_sampler.MaxAtomDistributedBatchSampler
_partial_: True
max_atoms: ${max_atoms}
shuffle: False
seed: 0
num_workers: ${cluster.dataloader_workers}
collate_fn:
_target_: fairchem.core.units.mlip_unit.mlip_unit.mt_collater_adapter
tasks: ${tasks}
heads:
oc22_energy:
module: fairchem.core.models.escaip.EScAIP.EScAIPEnergyHead
oc22_forces:
module: fairchem.core.models.escaip.EScAIP.EScAIPDirectForceHead
runner:
_target_: fairchem.core.components.train.train_runner.TrainEvalRunner
train_dataloader: ${train_dataloader}
eval_dataloader: ${eval_dataloader}
train_eval_unit:
_target_: fairchem.core.units.mlip_unit.mlip_unit.MLIPTrainEvalUnit
job_config: ${job}
tasks: ${tasks}
model:
_target_: fairchem.core.models.base.HydraModel
backbone: ${backbone}
heads: ${heads}
optimizer_fn:
_target_: torch.optim.AdamW
_partial_: true
lr: 8e-4
weight_decay: 1e-3
cosine_lr_scheduler_fn:
_target_: fairchem.core.units.mlip_unit.mlip_unit._get_consine_lr_scheduler
_partial_: true
warmup_factor: 0.2
warmup_epochs: 0.2
lr_min_factor: 0.01
epochs: ${epochs}
steps: ${steps}
print_every: 10
clip_grad_norm: 100
max_epochs: ${epochs}
max_steps: ${steps}
evaluate_every_n_steps: 5000
callbacks:
- _target_: fairchem.core.common.profiler_utils.ProfilerCallback
job_config: ${job}
- _target_: fairchem.core.components.train.train_runner.TrainCheckpointCallback
checkpoint_every_n_steps: 2000
max_saved_checkpoints: 5