-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LocalSGD / DiLoCo support #39
Comments
One of the additional points here is on when we allow rejoining/recovering. Our current implementation is quite rigid but with LocalSGD we may want more control for when we detect failing workers as well as when we allow them to recover to avoid blocking.
|
For the quorum we have a few options:
|
Excited to see this land - have you guys been considering Async DiLoCo as well? |
@dongreenberg I briefly read through the paper (https://arxiv.org/pdf/2401.09135) but don't have any plans to add it into torchft. We'd love to have it but just haven't had time to experiment with it |
How hard do you think it would be to implement? Are the async updates pretty incompatible with the current implementation? |
This is a tracking issue for adding LocalSGD support into torchft. There's been interest in LocalSGD support and it's something we'd like to be able to support.
This should be fairly straightforward as we can use the Manager + quorum in an outer loop and then use an allreduce only periodically copy of the weights.
Something like:
DiLoCo should be a small modification of this algorithm to use a separate optimizer instead of just averaging the weights
For efficiency we should probably use the DDP reducer on the parameters directly and copy underlying Storage to make a backup copy
References:
The text was updated successfully, but these errors were encountered: