Fix checkpoint loading and resume for fsdp #169

achalddave · 2023-12-19T21:45:32Z

This partially reverts #138. If we use FSDP, we should call load_model directly, instead of model.module.load_model. I tested training and resuming a model using main.py and it works with this commit.

Note that the tests in #148 pass despite this bug because of another bug that @kernelmachine discovered incidentally: Our tests are not properly using distributed functionality, because

open_lm/open_lm/distributed.py

Lines 20 to 25 in 6570b81

    
           def is_using_distributed(): 
        
               if "WORLD_SIZE" in os.environ: 
        
                   return int(os.environ["WORLD_SIZE"]) > 1 
        
               if "SLURM_NTASKS" in os.environ: 
        
                   return int(os.environ["SLURM_NTASKS"]) > 1 
        
               return False

only checks if world_size > 1 (instead of >= 1). This will be tracked in a follow up issue.

sagadre · 2023-12-20T16:20:20Z

can we actually get a version bump on this?

Fix checkpoint loading / resume for fsdp

b1b13fa

achalddave requested a review from Vaishaal December 19, 2023 21:45

sagadre approved these changes Dec 20, 2023

View reviewed changes

Update version

32dd5c8

sagadre merged commit a79aa35 into main Dec 20, 2023
2 checks passed

sagadre deleted the fix-checkpoint-load branch December 20, 2023 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix checkpoint loading and resume for fsdp #169

Fix checkpoint loading and resume for fsdp #169

achalddave commented Dec 19, 2023 •

edited

Loading

sagadre commented Dec 20, 2023

	def is_using_distributed():
	if "WORLD_SIZE" in os.environ:
	return int(os.environ["WORLD_SIZE"]) > 1
	if "SLURM_NTASKS" in os.environ:
	return int(os.environ["SLURM_NTASKS"]) > 1
	return False

Fix checkpoint loading and resume for fsdp #169

Fix checkpoint loading and resume for fsdp #169

Conversation

achalddave commented Dec 19, 2023 • edited Loading

sagadre commented Dec 20, 2023

achalddave commented Dec 19, 2023 •

edited

Loading