Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix checkpoint loading and resume for fsdp #169

Merged
merged 2 commits into from
Dec 20, 2023
Merged

Conversation

achalddave
Copy link
Collaborator

@achalddave achalddave commented Dec 19, 2023

This partially reverts #138. If we use FSDP, we should call load_model directly, instead of model.module.load_model. I tested training and resuming a model using main.py and it works with this commit.

Note that the tests in #148 pass despite this bug because of another bug that @kernelmachine discovered incidentally: Our tests are not properly using distributed functionality, because

def is_using_distributed():
if "WORLD_SIZE" in os.environ:
return int(os.environ["WORLD_SIZE"]) > 1
if "SLURM_NTASKS" in os.environ:
return int(os.environ["SLURM_NTASKS"]) > 1
return False
only checks if world_size > 1 (instead of >= 1). This will be tracked in a follow up issue.

@achalddave achalddave requested a review from Vaishaal December 19, 2023 21:45
@sagadre
Copy link
Collaborator

sagadre commented Dec 20, 2023

can we actually get a version bump on this?

@sagadre sagadre merged commit a79aa35 into main Dec 20, 2023
2 checks passed
@sagadre sagadre deleted the fix-checkpoint-load branch December 20, 2023 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants