-
Notifications
You must be signed in to change notification settings - Fork 37
avoid stream synchronization in manager #216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1f93550
to
77b6330
Compare
@@ -323,6 +326,7 @@ def allreduce( | |||
return fut | |||
|
|||
self.wait_quorum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this wait_quorum()
if self.num_participants()
already calls it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think good to leave it here for readability
Summary: - for diloco the model parameters, in the way they are saved by the test can be different across replicas - only the global parameters can be the same - fix the test to validate the global parameters are the same instead of the local model parameters Test Plan: ``` $ pytest -v ./torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_diloco_recovery_0 ```
Summary: - use a recovery event to synchronize on instead of the recovery stream - fix calling `work.wait()` in manager - avoid calling `quorum.wait` inside of a callback Test Plan: <img width="1483" alt="Screenshot 2025-06-16 at 1 08 04 AM" src="https://github.com/user-attachments/assets/47de0853-1878-40b9-ae8e-f9fc55972917" />
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary:
work.wait()
in managerquorum.wait
inside of a callbackTest Plan:

Stack created with Sapling. Best reviewed with ReviewStack.