remove every_n_train_steps from ModelCheckpoint #414

sichu2023 · 2024-11-07T22:03:20Z

Summary

Drop all val_check_interval to Trainer in favor of every_n_train_steps to ModelCheckpoint.

Details

In the current NeMo version, providing every_n_train_steps to ModelCheckpoint and val_check_interval to Trainer will save the first checkpoint before validation. This leads to incorrect val_loss logging in the first checkpoint name.

Also in favor of val_loss as monitored metric over reduced_train_loss.

sichu2023 · 2024-11-07T22:37:41Z

Dropped since it is only compatible to newer NeMo version.

ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[LearningRateCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[GlobalStepStateCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[ConsumedSamplesCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[OptimizerStateCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[TrainInputCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[TrainOutputCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[TrainLossCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_train_val_init_consumed_samples - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency_with_uneven_validation_sizes[ValidInputCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency_with_uneven_validation_sizes[ValidOutputCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency_with_uneven_validation_sizes[ValidLossCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.

remove every_n_train_steps from ModelCheckpoint

b1abb50

sichu2023 requested review from jstjohn, malcolmgreaves, skothenhill-nv, farhadrgh, dorotat-nv and pstjohn as code owners November 7, 2024 22:03

sichu2023 mentioned this pull request Nov 7, 2024

Update NeMo/Megatron #302

Closed

sichu2023 marked this pull request as draft November 7, 2024 22:36

sichu2023 closed this Nov 7, 2024

pstjohn deleted the sichu/every_n_train_steps branch January 17, 2025 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove every_n_train_steps from ModelCheckpoint #414

remove every_n_train_steps from ModelCheckpoint #414

sichu2023 commented Nov 7, 2024

sichu2023 commented Nov 7, 2024

remove every_n_train_steps from ModelCheckpoint #414

remove every_n_train_steps from ModelCheckpoint #414

Conversation

sichu2023 commented Nov 7, 2024

Summary

Details

sichu2023 commented Nov 7, 2024