Skip to content

Skip dist all2all related case #1675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Skip dist all2all related case #1675

wants to merge 1 commit into from

Conversation

Chao1Han
Copy link
Contributor

No description provided.

@Copilot Copilot AI review requested due to automatic review settings May 16, 2025 07:36
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR disables several distributed all-to-all tests by commenting them out, effectively skipping these tests during execution.

  • Commented out the test for alltoall operations with xpufree race
  • Commented out multiple tests for all-to-all single operations and variants

Comment on lines +197 to +210
# @requires_xccl()
# @skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "XCCL test requires 2+ GPUs")
# def test_alltoall_ops_with_xpufree_race(self):
# pg = self.pg
# opts = c10d.AllToAllOptions()
# local_device = f"xpu:{self.rank_to_GPU[self.rank][0]}"
# torch.xpu.set_device(local_device)
# input = torch.rand(1000, 1000, device=local_device)
# output = torch.rand(1000, 1000, device=local_device)
# race_tensors = []
# # create some tensors to race with alltoall collective
# for _ in range(10):
# tmp = []
# for i in range(5):
Copy link
Preview

Copilot AI May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Disabling the 'test_alltoall_ops_with_xpufree_race' by commenting out the test may lead to code clutter; consider using a skip decorator or removing the test if it's no longer required.

Suggested change
# @requires_xccl()
# @skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "XCCL test requires 2+ GPUs")
# def test_alltoall_ops_with_xpufree_race(self):
# pg = self.pg
# opts = c10d.AllToAllOptions()
# local_device = f"xpu:{self.rank_to_GPU[self.rank][0]}"
# torch.xpu.set_device(local_device)
# input = torch.rand(1000, 1000, device=local_device)
# output = torch.rand(1000, 1000, device=local_device)
# race_tensors = []
# # create some tensors to race with alltoall collective
# for _ in range(10):
# tmp = []
# for i in range(5):
@unittest.skip("Skipping test_alltoall_ops_with_xpufree_race due to known issues with XPU free race conditions.")
@requires_xccl()
@skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "XCCL test requires 2+ GPUs")
def test_alltoall_ops_with_xpufree_race(self):
pg = self.pg
opts = c10d.AllToAllOptions()
local_device = f"xpu:{self.rank_to_GPU[self.rank][0]}"
torch.xpu.set_device(local_device)
input = torch.rand(1000, 1000, device=local_device)
output = torch.rand(1000, 1000, device=local_device)
race_tensors = []
# create some tensors to race with alltoall collective
for _ in range(10):
tmp = []

Copilot uses AI. Check for mistakes.

Comment on lines +843 to +858
# @requires_xccl()
# @skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "XCCL test requires 2+ GPUs")
# def test_all_to_all_single(self):
# device = self.rank_to_GPU[self.rank][0]
# row = self.world_size * (self.rank + 1) * (self.world_size + 1) / 2
# x = torch.ones(int(row), 5, device=device) * (self.rank + 1)
# x.requires_grad = True
# y = torch.empty_like(x)
# split_sizes = [(i + 1) * (self.rank + 1) for i in range(self.world_size)]
# y = torch.distributed.nn.all_to_all_single(
# y, x, output_split_sizes=split_sizes, input_split_sizes=split_sizes
# )
# expected = []
# for idx, tensor in enumerate(torch.split(x, split_sizes)):
# expected.append(torch.full_like(tensor, (idx + 1)))
# expected = torch.cat(expected)
Copy link
Preview

Copilot AI May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The block of tests for various all-to-all operations has been commented out rather than formally skipped; consider refactoring with proper skip annotations or removing this code to improve maintainability.

Suggested change
# @requires_xccl()
# @skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "XCCL test requires 2+ GPUs")
# def test_all_to_all_single(self):
# device = self.rank_to_GPU[self.rank][0]
# row = self.world_size * (self.rank + 1) * (self.world_size + 1) / 2
# x = torch.ones(int(row), 5, device=device) * (self.rank + 1)
# x.requires_grad = True
# y = torch.empty_like(x)
# split_sizes = [(i + 1) * (self.rank + 1) for i in range(self.world_size)]
# y = torch.distributed.nn.all_to_all_single(
# y, x, output_split_sizes=split_sizes, input_split_sizes=split_sizes
# )
# expected = []
# for idx, tensor in enumerate(torch.split(x, split_sizes)):
# expected.append(torch.full_like(tensor, (idx + 1)))
# expected = torch.cat(expected)
@requires_xccl()
@skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "XCCL test requires 2+ GPUs")
def test_all_to_all_single(self):
device = self.rank_to_GPU[self.rank][0]
row = self.world_size * (self.rank + 1) * (self.world_size + 1) / 2
x = torch.ones(int(row), 5, device=device) * (self.rank + 1)
x.requires_grad = True
y = torch.empty_like(x)
split_sizes = [(i + 1) * (self.rank + 1) for i in range(self.world_size)]
y = torch.distributed.nn.all_to_all_single(
y, x, output_split_sizes=split_sizes, input_split_sizes=split_sizes
)
expected = []
for idx, tensor in enumerate(torch.split(x, split_sizes)):
expected.append(torch.full_like(tensor, (idx + 1)))
expected = torch.cat(expected)

Copilot uses AI. Check for mistakes.

@Chao1Han Chao1Han closed this May 22, 2025
@Chao1Han Chao1Han deleted the xccl/skip_all2all branch May 22, 2025 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant