Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove every_n_train_steps from ModelCheckpoint #414

Closed
wants to merge 1 commit into from

Conversation

sichu2023
Copy link
Collaborator

Summary

Drop all val_check_interval to Trainer in favor of every_n_train_steps to ModelCheckpoint.

Details

In the current NeMo version, providing every_n_train_steps to ModelCheckpoint and val_check_interval to Trainer will save the first checkpoint before validation. This leads to incorrect val_loss logging in the first checkpoint name.

Also in favor of val_loss as monitored metric over reduced_train_loss.

@sichu2023
Copy link
Collaborator Author

Dropped since it is only compatible to newer NeMo version.

ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[LearningRateCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[GlobalStepStateCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[ConsumedSamplesCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[OptimizerStateCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[TrainInputCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[TrainOutputCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[TrainLossCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_train_val_init_consumed_samples - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency_with_uneven_validation_sizes[ValidInputCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency_with_uneven_validation_sizes[ValidOutputCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.
ERROR sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency_with_uneven_validation_sizes[ValidLossCallback] - nemo.utils.exp_manager.NotFoundError: There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/tmpuyrh4jvz/TestESM2StopAndGo/checkpoints. Cannot resume.

@sichu2023 sichu2023 closed this Nov 7, 2024
@pstjohn pstjohn deleted the sichu/every_n_train_steps branch January 17, 2025 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant