[Platform][Dist] Make torch distributed process group extendable #18763

MengqingCao · 2025-05-27T12:51:00Z

Starting from torch 2.6.0, the arg options in ProcessGroup.__init__ has been removed, and ProcessGroup._set_default_backend has been introduced. Howerver, torch 2.5.1 is still be maintained in vllm-ascend. Thus we keep the compatiblity with torch < 2.6 in this pr and make stateless_init_torch_distributed_process_group extendable by platform module.

Mainly changes:

Make torch distributed process group extandable
Make process group initialization for gloo backend compatible with torch < 2.6

Signed-off-by: Mengqing Cao <[email protected]>

github-actions · 2025-05-27T12:51:09Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-05-27T14:22:28Z

vllm/distributed/utils.py

+        pg: ProcessGroup = ProcessGroup(
+            prefix_store,
+            group_rank,
+            group_size,
+        )


Why not pass ProcessGroup itself into stateless_init_device_torch_dist_pg?

This allows the platform to customize the specific details when creating pg, for example, in vllm-ascend due to torch version (2.5) compatibility issues, it is necessary to pass in the arg options when creating pg.

Perhaps we can factor this out into another method like init_process_group

Also, is gloo compatible with all the backends?

Perhaps we can factor this out into another method like init_process_group

Do you mean refactor this compatibility solution in init_process_group？

Also, is gloo compatible with all the backends?

Acctually gloo is a communication backend for cpu, I think it won't be conflicted with the chips like GPU and NPU. I keep it here to make sure all platforms could create process group with gloo backend. But you remind me that I lose the torch compatibility here.

Do you mean refactor this compatibility solution in init_process_group？

Yes. It would also make other backends compatible with gloo in case they need custom initialization of process group

Got it, thanks! I'll fix it tomorrow :-)

Yes. It would also make other backends compatible with gloo in case they need custom initialization of process group

This has been done now, PTAL

Signed-off-by: Mengqing Cao <[email protected]>

DarkLight1337 · 2025-05-28T08:03:16Z

vllm/distributed/utils.py

+    if is_torch_equal_or_newer("2.6"):
+        pg = ProcessGroup(
+            prefix_store,
+            group_rank,
+            group_size,
+        )
+    else:
+        options = ProcessGroup.Options(backend=backend)
+        pg = ProcessGroup(
+            prefix_store,
+            group_rank,
+            group_size,
+            options,
+        )


Is this code not applicable to all platforms?

It is, but there still exists diference at ProcessGroup._set_default_backend. Thus I think it's cleaner to create and return a complete process group. And this make the extracted function more complete, rather than doing some preparation for the process group

DarkLight1337

Alright, thanks for the explanation!

MengqingCao · 2025-05-28T08:27:14Z

Alright, thanks for the explanation!

Thanks for your review 👍

…m-project#18763) Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: amit <[email protected]>

[Platform][Dist] Make torch distributed process group extandable

af5ff32

Signed-off-by: Mengqing Cao <[email protected]>

DarkLight1337 changed the title ~~[Platform][Dist] Make torch distributed process group extandable~~ [Platform][Dist] Make torch distributed process group extendable May 27, 2025

DarkLight1337 reviewed May 27, 2025

View reviewed changes

njhill requested a review from youkaichao May 27, 2025 20:25

make pg of gloo compatible with torch<2.6

c136cad

Signed-off-by: Mengqing Cao <[email protected]>

DarkLight1337 reviewed May 28, 2025

View reviewed changes

DarkLight1337 approved these changes May 28, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) May 28, 2025 08:21

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 28, 2025

DarkLight1337 merged commit d781930 into vllm-project:main May 28, 2025
72 of 74 checks passed

MengqingCao deleted the dist branch May 29, 2025 01:05

MengqingCao mentioned this pull request May 29, 2025

[bugfix] some bugs maybe fail to run vllm-project/vllm-ascend#896

Open

amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025

[Platform][Dist] Make torch distributed process group extendable (vll…

641ef87

…m-project#18763) Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: amit <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Platform][Dist] Make torch distributed process group extendable #18763

[Platform][Dist] Make torch distributed process group extendable #18763

Uh oh!

MengqingCao commented May 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented May 27, 2025

Uh oh!

DarkLight1337 May 27, 2025

Uh oh!

MengqingCao May 27, 2025

Uh oh!

DarkLight1337 May 27, 2025 •

edited

Loading

Uh oh!

DarkLight1337 May 27, 2025

Uh oh!

MengqingCao May 27, 2025 •

edited

Loading

Uh oh!

DarkLight1337 May 27, 2025

Uh oh!

MengqingCao May 27, 2025

Uh oh!

MengqingCao May 28, 2025

Uh oh!

DarkLight1337 May 28, 2025

Uh oh!

MengqingCao May 28, 2025

Uh oh!

DarkLight1337 left a comment

Uh oh!

MengqingCao commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Platform][Dist] Make torch distributed process group extendable #18763

[Platform][Dist] Make torch distributed process group extendable #18763

Uh oh!

Conversation

MengqingCao commented May 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MengqingCao May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

MengqingCao commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

MengqingCao commented May 27, 2025 •

edited by github-actions bot

Loading

DarkLight1337 May 27, 2025 •

edited

Loading

MengqingCao May 27, 2025 •

edited

Loading