Skip dist all2all related case #1675

Chao1Han · 2025-05-16T07:36:36Z

No description provided.

Copilot

Pull Request Overview

This PR disables several distributed all-to-all tests by commenting them out, effectively skipping these tests during execution.

Commented out the test for alltoall operations with xpufree race
Commented out multiple tests for all-to-all single operations and variants

Copilot · 2025-05-16T07:37:24Z

test/xpu/distributed/test_c10d_ops_xccl.py

+    # @requires_xccl()
+    # @skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "XCCL test requires 2+ GPUs")
+    # def test_alltoall_ops_with_xpufree_race(self):
+    #     pg = self.pg
+    #     opts = c10d.AllToAllOptions()
+    #     local_device = f"xpu:{self.rank_to_GPU[self.rank][0]}"
+    #     torch.xpu.set_device(local_device)
+    #     input = torch.rand(1000, 1000, device=local_device)
+    #     output = torch.rand(1000, 1000, device=local_device)
+    #     race_tensors = []
+    #     # create some tensors to race with alltoall collective
+    #     for _ in range(10):
+    #         tmp = []
+    #         for i in range(5):


[nitpick] Disabling the 'test_alltoall_ops_with_xpufree_race' by commenting out the test may lead to code clutter; consider using a skip decorator or removing the test if it's no longer required.

Suggested change

# @requires_xccl()

# @skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "XCCL test requires 2+ GPUs")

# def test_alltoall_ops_with_xpufree_race(self):

# pg = self.pg

# opts = c10d.AllToAllOptions()

# local_device = f"xpu:{self.rank_to_GPU[self.rank][0]}"

# torch.xpu.set_device(local_device)

# input = torch.rand(1000, 1000, device=local_device)

# output = torch.rand(1000, 1000, device=local_device)

# race_tensors = []

# # create some tensors to race with alltoall collective

# for _ in range(10):

# tmp = []

# for i in range(5):

@unittest.skip("Skipping test_alltoall_ops_with_xpufree_race due to known issues with XPU free race conditions.")

@requires_xccl()

@skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "XCCL test requires 2+ GPUs")

def test_alltoall_ops_with_xpufree_race(self):

pg = self.pg

opts = c10d.AllToAllOptions()

local_device = f"xpu:{self.rank_to_GPU[self.rank][0]}"

torch.xpu.set_device(local_device)

input = torch.rand(1000, 1000, device=local_device)

output = torch.rand(1000, 1000, device=local_device)

race_tensors = []

# create some tensors to race with alltoall collective

for _ in range(10):

tmp = []

Copilot · 2025-05-16T07:37:25Z

test/xpu/distributed/test_c10d_ops_xccl.py

+    # @requires_xccl()
+    # @skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "XCCL test requires 2+ GPUs")
+    # def test_all_to_all_single(self):
+    #     device = self.rank_to_GPU[self.rank][0]
+    #     row = self.world_size * (self.rank + 1) * (self.world_size + 1) / 2
+    #     x = torch.ones(int(row), 5, device=device) * (self.rank + 1)
+    #     x.requires_grad = True
+    #     y = torch.empty_like(x)
+    #     split_sizes = [(i + 1) * (self.rank + 1) for i in range(self.world_size)]
+    #     y = torch.distributed.nn.all_to_all_single(
+    #         y, x, output_split_sizes=split_sizes, input_split_sizes=split_sizes
+    #     )
+    #     expected = []
+    #     for idx, tensor in enumerate(torch.split(x, split_sizes)):
+    #         expected.append(torch.full_like(tensor, (idx + 1)))
+    #     expected = torch.cat(expected)


[nitpick] The block of tests for various all-to-all operations has been commented out rather than formally skipped; consider refactoring with proper skip annotations or removing this code to improve maintainability.

Suggested change

# @requires_xccl()

# @skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "XCCL test requires 2+ GPUs")

# def test_all_to_all_single(self):

# device = self.rank_to_GPU[self.rank][0]

# row = self.world_size * (self.rank + 1) * (self.world_size + 1) / 2

# x = torch.ones(int(row), 5, device=device) * (self.rank + 1)

# x.requires_grad = True

# y = torch.empty_like(x)

# split_sizes = [(i + 1) * (self.rank + 1) for i in range(self.world_size)]

# y = torch.distributed.nn.all_to_all_single(

# y, x, output_split_sizes=split_sizes, input_split_sizes=split_sizes

# )

# expected = []

# for idx, tensor in enumerate(torch.split(x, split_sizes)):

# expected.append(torch.full_like(tensor, (idx + 1)))

# expected = torch.cat(expected)

@requires_xccl()

@skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "XCCL test requires 2+ GPUs")

def test_all_to_all_single(self):

device = self.rank_to_GPU[self.rank][0]

row = self.world_size * (self.rank + 1) * (self.world_size + 1) / 2

x = torch.ones(int(row), 5, device=device) * (self.rank + 1)

x.requires_grad = True

y = torch.empty_like(x)

split_sizes = [(i + 1) * (self.rank + 1) for i in range(self.world_size)]

y = torch.distributed.nn.all_to_all_single(

y, x, output_split_sizes=split_sizes, input_split_sizes=split_sizes

)

expected = []

for idx, tensor in enumerate(torch.split(x, split_sizes)):

expected.append(torch.full_like(tensor, (idx + 1)))

expected = torch.cat(expected)

Copilot AI review requested due to automatic review settings May 16, 2025 07:36

Copilot AI reviewed May 16, 2025

View reviewed changes

Skip dist all2all related case

d0ef796

Chao1Han closed this May 22, 2025

Chao1Han deleted the xccl/skip_all2all branch May 22, 2025 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Skip dist all2all related case #1675

Skip dist all2all related case #1675

Uh oh!

Chao1Han commented May 16, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 16, 2025

Uh oh!

Copilot AI May 16, 2025

Uh oh!

Uh oh!

Skip dist all2all related case #1675

Skip dist all2all related case #1675

Uh oh!

Conversation

Chao1Han commented May 16, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI May 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!