Description
🐛 Describe the bug
test.distributed.tensor.parallel.test_parallelize_api.TensorParallelAPITests | test_linear_row_wise_parallel | failed | AssertionError: Tensor-likes are not close!
Get the pytorch version and torch-xpu-ops version from https://wiki.ith.intel.com/pages/viewpage.action?spaceKey=mlpcdlval&title=distributed-ww17
cd <pytorch>/test/distributed/tensor/parallel
pytest -v test_parallelize_api.py -k test_linear_row_wise_parallel
···
../../../../test/distributed/tensor/parallel/test_parallelize_api.py::TensorParallelAPITests::test_linear_row_wise_parallel 2025:04:25-18:39:30:(2380657) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2025:04:25-18:39:30:(2380657) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2025:04:25-18:39:30:(2380656) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2025:04:25-18:39:30:(2380656) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2025:04:25-18:39:30:(2380654) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2025:04:25-18:39:30:(2380654) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2025:04:25-18:39:30:(2380655) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2025:04:25-18:39:30:(2380655) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_comparison.py:330: UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
Consider using tensor.detach() first. (Triggered internally at /home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/aten/src/ATen/native/Scalar.cpp:22.)
abs_diff=max_abs_diff.item(),
/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_comparison.py:330: UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
Consider using tensor.detach() first. (Triggered internally at /home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/aten/src/ATen/native/Scalar.cpp:22.)
abs_diff=max_abs_diff.item(),
/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_comparison.py:330: UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
Consider using tensor.detach() first. (Triggered internally at /home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/aten/src/ATen/native/Scalar.cpp:22.)
abs_diff=max_abs_diff.item(),
/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_comparison.py:330: UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
Consider using tensor.detach() first. (Triggered internally at /home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/aten/src/ATen/native/Scalar.cpp:22.)
abs_diff=max_abs_diff.item(),
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] Caught exception:
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] Traceback (most recent call last):
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 761, in run_test
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] getattr(self, test_name)()
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 634, in wrapper
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] fn()
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3155, in wrapper
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] method(*args, **kwargs)
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 410, in wrapper
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] raise e
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 407, in wrapper
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] func(self, *args, **kwargs) # type: ignore[misc]
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/tensor/parallel/test_parallelize_api.py", line 154, in test_linear_row_wise_parallel
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] self._compare_module(model, model_tp, inp_size, rowwise=True)
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/tensor/parallel/test_parallelize_api.py", line 77, in _compare_module
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] self._compare_params(local_module, dist_module, rank0_only)
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/tensor/parallel/test_parallelize_api.py", line 61, in _compare_params
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] self.assertEqual(
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4096, in assertEqual
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] raise error_metas.pop()[0].to_error( # type: ignore[index]
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] AssertionError: Tensor-likes are not close!
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768]
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] Mismatched elements: 80 / 160 (50.0%)
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] Greatest absolute difference: 0.4322415590286255 at index (2, 11) (up to 1e-05 allowed)
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] Greatest relative difference: 27.006444931030273 at index (2, 9) (up to 1.3e-06 allowed)
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] weight not equal between dist and non-dist
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768]
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] To execute this test, run the following from the base repo dir:
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] PYTORCH_TEST_WITH_SLOW=1 python test/distributed/tensor/parallel/test_parallelize_api.py TensorParallelAPITests.test_linear_row_wise_parallel
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768]
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
E0425 18:39:32.203000 2380656 site-packages/torch/testing/_internal/common_distributed.py:768] exiting process 2 with exit code: 10
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] Caught exception:
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] Traceback (most recent call last):
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 761, in run_test
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] getattr(self, test_name)()
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 634, in wrapper
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] fn()
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3155, in wrapper
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] method(*args, **kwargs)
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 410, in wrapper
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] raise e
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 407, in wrapper
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] func(self, *args, **kwargs) # type: ignore[misc]
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/tensor/parallel/test_parallelize_api.py", line 154, in test_linear_row_wise_parallel
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] self._compare_module(model, model_tp, inp_size, rowwise=True)
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/tensor/parallel/test_parallelize_api.py", line 77, in _compare_module
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] self._compare_params(local_module, dist_module, rank0_only)
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/tensor/parallel/test_parallelize_api.py", line 61, in _compare_params
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] self.assertEqual(
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4096, in assertEqual
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] raise error_metas.pop()[0].to_error( # type: ignore[index]
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] AssertionError: Tensor-likes are not close!
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768]
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] Mismatched elements: 80 / 160 (50.0%)
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] Greatest absolute difference: 0.4322415590286255 at index (2, 11) (up to 1e-05 allowed)
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] Greatest relative difference: 27.006444931030273 at index (2, 9) (up to 1.3e-06 allowed)
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768] weight not equal between dist and non-dist
E0425 18:39:32.206000 2380654 site-packages/torch/testing/_internal/common_distributed.py:768]
···
Platform
Data Center GPU Max 1100 OpenCL 3.0 NEO [25.05.32567]
libigc2 2.7.11-1099~22.04
GPU 0/0 GPU 1/0 GPU 2/0 GPU 3/0 GPU 4/0 GPU 5/0 GPU 6/0 GPU 7/0 CPU Affinity
GPU 0/0 S XL8 XL8 XL8 SYS SYS SYS SYS 0-47,96-143
GPU 1/0 XL8 S XL8 XL8 SYS SYS SYS SYS 0-47,96-143
GPU 2/0 XL8 XL8 S XL8 SYS SYS SYS SYS 0-47,96-143
GPU 3/0 XL8 XL8 XL8 S SYS SYS SYS SYS 0-47,96-143
GPU 4/0 SYS SYS SYS SYS S XL8 XL8 XL8 48-95,144-191
GPU 5/0 SYS SYS SYS SYS XL8 S XL8 XL8 48-95,144-191
GPU 6/0 SYS SYS SYS SYS XL8 XL8 S XL8 48-95,144-191
GPU 7/0 SYS SYS SYS SYS XL8 XL8 XL8 S 48-95,144-191
ZE_AFFINITY_MASK=0,1,2,3
Versions
See https://wiki.ith.intel.com/display/mlpcdlval/distributed-ww17