Description
🐛 Describe the bug
AssertionError: Tensor-likes are not close!
cases:
../../../../test/distributed/pipelining/test_backward.py | test_stage_backward_weight_multiple_iters_xpu
../../../../test/distributed/pipelining/test_backward.py | test_stage_backward_weight_xpu
../../../../test/distributed/pipelining/test_backward.py | test_stage_backward_xpu
../../../../test/distributed/pipelining/test_microbatch.py | test_chunk_spec_xpu
log:
____________________________________________________________________________ StageBackwardTestsXPU.test_stage_backward_weight_multiple_iters_xpu _____________________________________________________________________________
Traceback (most recent call last):
File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
yield
File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/unittest/case.py", line 591, in run
self._callTestMethod(testMethod)
File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
method()
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_utils.py", line 3142, in wrapper
method(*args, **kwargs)
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test
result = test(self, **param_kwargs)
File "/home/sdp/penghuic/pytorch/test/distributed/pipelining/test_backward.py", line 181, in test_stage_backward_weight_multiple_iters
torch.testing.assert_close(p.grad, ref_p.grad)
File "/home/sdp/penghuic/pytorch/torch/testing/_comparison.py", line 1587, in assert_close
raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!
Mismatched elements: 3532 / 262144 (1.3%)
Greatest absolute difference: 4.38690185546875e-05 at index (355, 35) (up to 1e-05 allowed)
Greatest relative difference: 0.02471482940018177 at index (474, 307) (up to 1.3e-06 allowed)
To execute this test, run the following from the base repo dir:
python test/distributed/pipelining/test_backward.py StageBackwardTestsXPU.test_stage_backward_weight_multiple_iters_xpu
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0