Skip to content

Distinguishing between replica rank and group rank across the project (#181) #187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 6, 2025

Conversation

WarrenZhu050413
Copy link
Contributor

Distinguishes between replica rank and group rank across the projects as per #181.

Kept the API the same for all except for the DistributedSampler. I found the terminology especially confusing there. And since it is not used in torchTitan I changed the API.

All tests have passed. However, I am getting the following linter error. It doesn't seem like a problem with my code.

@warren# lintrunner -a
Warning: Could not find a lintrunner config at: '.lintrunner.private.toml'. Continuing without using configuration file.

>>> General linter failure:

  Advice (pyre) command-failed
    Failed due to JSONDecodeError:
    Expecting value: line 1 column 1 (char 0)
Successfully applied all patches.
(/srv/apps/danny/miniconda3/envs/warren/torchtitan) root@sz-k8s-master:/srv/apps/warren/fork/torchft# 

Changes

Globally

Changed recovery_src_rank, recovery_dst_ranks, max_rank to _replica_rank.

In Manager.py:

rank -> group_rank
self._rank -> self._group_rank
self._participating_rank -> self._participating_replica_rank
self._participating_world_size -> self._participating_replica_world_size 
max_world_size -> max_replica_world_size world_size -> group_world_size 
self._world_size_mode -> self._replica_world_size_mode

In Data.py

Changed API along with internal reference:

rank -> group_rank
replica_group -> replica_rank

In Rust Code

ShouldCommitRequest: rank -> group_rank
self._client._quorum: rank -> group_rank
Did not change CheckpointMetadataRequest's proto interface since it should be it seems less tied to the specific group_rank/replica rank distinction made here.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 5, 2025
@d4l3k
Copy link
Member

d4l3k commented May 5, 2025

@WarrenZhu050413 does running pyre manually work?

Did you run lintrunner init?

Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@WarrenZhu050413
Copy link
Contributor Author

@WarrenZhu050413 does running pyre manually work?

Did you run lintrunner init?

I did do both.

Thanks for helping out with the Lint!

@d4l3k d4l3k merged commit 652a009 into pytorch:main May 6, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants