Skip to content

[distributed] Accuracy issue in _composable compile related UT #1668

Open
@PenghuiCheng

Description

@PenghuiCheng

🐛 Describe the bug

PyTorch: https://github.com/daisyden/pytorch/tree/distributed_2.8
Torch-xpu-ops: https://github.com/intel/torch-xpu-ops/tree/daisyden/distributed_2.8
oneAPI: 2025.1.1

cases:
test/distributed/_composable/test_replicate_with_compiler.py
"test_compile_backward_only",
"test_compile_bf16",
"test_compile_fp16",
"test_compile_gpu",
"test_compile_gpu_ac",

logs:
Traceback (most recent call last):
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 637, in wrapper
self._join_processes(fn)
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 877, in _join_processes
self._check_return_codes(elapsed_time)
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 926, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 766, in run_test
getattr(self, test_name)()
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 639, in wrapper
fn()
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_utils.py", line 3155, in wrapper
method(*args, **kwargs)
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 209, in wrapper
return func(*args, **kwargs)
File "/home/sdp/penghuic/pytorch/test/distributed/_composable/test_replicate_with_compiler.py", line 250, in test_compile_backward_only
self._test_compile(no_sync=False, no_compile_forward=True, device=device_type)
File "/home/sdp/penghuic/pytorch/test/distributed/_composable/test_replicate_with_compiler.py", line 165, in _test_compile
self.assertEqual(p1.grad, p2.grad)
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_utils.py", line 4103, in assertEqual
raise error_metas.pop()[0].to_error( # type: ignore[index]
AssertionError: Tensor-likes are not close!

Mismatched elements: 1925 / 4000000 (0.0%)
Greatest absolute difference: 0.0006995201110839844 at index (1882, 1160) (up to 1e-05 allowed)
Greatest relative difference: 0.00012343529670033604 at index (1882, 1504) (up to 1.3e-06 allowed)

To execute this test, run the following from the base repo dir:
python test/distributed/_composable/test_replicate_with_compiler.py ReplicateTest.test_compile_backward_only

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 766, in run_test
getattr(self, test_name)()
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 639, in wrapper
fn()
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_utils.py", line 3155, in wrapper
method(*args, **kwargs)
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 209, in wrapper
return func(*args, **kwargs)
File "/home/sdp/penghuic/pytorch/test/distributed/_composable/test_replicate_with_compiler.py", line 250, in test_compile_backward_only
self._test_compile(no_sync=False, no_compile_forward=True, device=device_type)
File "/home/sdp/penghuic/pytorch/test/distributed/_composable/test_replicate_with_compiler.py", line 165, in _test_compile
self.assertEqual(p1.grad, p2.grad)
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_utils.py", line 4103, in assertEqual
raise error_metas.pop()[0].to_error( # type: ignore[index]
AssertionError: Tensor-likes are not close!

Mismatched elements: 1925 / 4000000 (0.0%)
Greatest absolute difference: 0.0006995201110839844 at index (1882, 1160) (up to 1e-05 allowed)
Greatest relative difference: 0.00012343529670033604 at index (1882, 1504) (up to 1.3e-06 allowed)

To execute this test, run the following from the base repo dir:
python test/distributed/_composable/test_replicate_with_compiler.py ReplicateTest.test_compile_backward_only

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Versions

env.txt

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingmodule: distributedFor distributed feature issue

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions