Distinguishing between replica rank and group rank across the project (#181) #187
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Distinguishes between replica rank and group rank across the projects as per #181.
Kept the API the same for all except for the DistributedSampler. I found the terminology especially confusing there. And since it is not used in torchTitan I changed the API.
All tests have passed. However, I am getting the following linter error. It doesn't seem like a problem with my code.
@warren# lintrunner -a Warning: Could not find a lintrunner config at: '.lintrunner.private.toml'. Continuing without using configuration file. >>> General linter failure: Advice (pyre) command-failed Failed due to JSONDecodeError: Expecting value: line 1 column 1 (char 0) Successfully applied all patches. (/srv/apps/danny/miniconda3/envs/warren/torchtitan) root@sz-k8s-master:/srv/apps/warren/fork/torchft#
Changes
Globally
Changed recovery_src_rank, recovery_dst_ranks, max_rank to
_replica_rank
.In Manager.py:
In Data.py
Changed API along with internal reference:
In Rust Code
ShouldCommitRequest:
rank -> group_rankself._client._quorum
: rank -> group_rankDid not change CheckpointMetadataRequest's proto interface since it should be it seems less tied to the specific group_rank/replica rank distinction made here.