Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test for checkpoint loading after save #145

Closed
achalddave opened this issue Dec 11, 2023 · 3 comments
Closed

Add test for checkpoint loading after save #145

achalddave opened this issue Dec 11, 2023 · 3 comments
Assignees

Comments

@achalddave
Copy link
Collaborator

We should add a test that:

  1. trains a (small) model for a couple steps
  2. saves it to disk
  3. calls main() again with a path to the checkpoint on disk
  4. trains a few steps

The test should test single process, DDP, and FSDP.

@jmercat
Copy link
Collaborator

jmercat commented Dec 11, 2023

I cannot assign this to myself but I can do it

@Vaishaal
Copy link
Contributor

This is the error I am getting right now:


Traceback (most recent call last):  
File "/miniconda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/miniconda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/task_runtime/open_lm/open_lm/main.py", line 798, in <module>
    main(sys.argv[1:])
  File "/mnt/task_runtime/open_lm/open_lm/main.py", line 502, in main
    start_epoch, global_step = load_model(args, model)
  File "/mnt/task_runtime/open_lm/open_lm/main.py", line 111, in load_model
    model.module.load_state_dict(sd)
  File "/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Transformer:
	Missing key(s) in state_dict: "_flat_param".
	Unexpected key(s) in state_dict: "tok_embeddings.weight", "norm.weight", "norm.bias", "output.weight".
	```

@achalddave
Copy link
Collaborator Author

Closing this issue as we have added a test in #148, but @Vaishaal feel free to open a new issue if you reproduce the checkpoint loading bug you have

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants