-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[Platform][Dist] Make torch distributed process group extendable #18763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Mengqing Cao <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
vllm/distributed/utils.py
Outdated
pg: ProcessGroup = ProcessGroup( | ||
prefix_store, | ||
group_rank, | ||
group_size, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass ProcessGroup
itself into stateless_init_device_torch_dist_pg
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows the platform to customize the specific details when creating pg
, for example, in vllm-ascend due to torch version (2.5) compatibility issues, it is necessary to pass in the arg options
when creating pg
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can factor this out into another method like init_process_group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is gloo compatible with all the backends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can factor this out into another method like
init_process_group
Do you mean refactor this compatibility solution in init_process_group
?
Also, is gloo compatible with all the backends?
Acctually gloo is a communication backend for cpu, I think it won't be conflicted with the chips like GPU and NPU. I keep it here to make sure all platforms could create process group with gloo backend. But you remind me that I lose the torch compatibility here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean refactor this compatibility solution in init_process_group?
Yes. It would also make other backends compatible with gloo in case they need custom initialization of process group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks! I'll fix it tomorrow :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It would also make other backends compatible with gloo in case they need custom initialization of process group
This has been done now, PTAL
Signed-off-by: Mengqing Cao <[email protected]>
if is_torch_equal_or_newer("2.6"): | ||
pg = ProcessGroup( | ||
prefix_store, | ||
group_rank, | ||
group_size, | ||
) | ||
else: | ||
options = ProcessGroup.Options(backend=backend) | ||
pg = ProcessGroup( | ||
prefix_store, | ||
group_rank, | ||
group_size, | ||
options, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code not applicable to all platforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, but there still exists diference at ProcessGroup._set_default_backend
. Thus I think it's cleaner to create and return a complete process group. And this make the extracted function more complete, rather than doing some preparation for the process group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, thanks for the explanation!
Thanks for your review 👍 |
…m-project#18763) Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: amit <[email protected]>
part of #11162
Starting from torch 2.6.0, the arg
options
inProcessGroup.__init__
has been removed, andProcessGroup._set_default_backend
has been introduced. Howerver, torch 2.5.1 is still be maintained in vllm-ascend. Thus we keep the compatiblity with torch < 2.6 in this pr and makestateless_init_torch_distributed_process_group
extendable by platform module.Mainly changes: