Description
🐛 Describe the bug
RuntimeError: Work ran time out after 18928 milliseconds.
reproduce step:
pytest -vs test_inductor_collectives.py -k test_eager_async_allreduce_inductor_wait
error message:
Traceback (most recent call last):
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 643, in wrapper
self._join_processes(fn)
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 907, in _join_processes
self._check_return_codes(fn, elapsed_time)
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 947, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 791, in run_test
getattr(self, test_name)()
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 645, in wrapper
fn()
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_utils.py", line 3148, in wrapper
method(*args, **kwargs)
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 205, in wrapper
return func(*args, **kwargs)
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_utils.py", line 1869, in wrapper
return fn(*args, **kwargs)
File "/home/sdp/penghuic/pytorch/test/distributed/test_inductor_collectives.py", line 310, in test_eager_async_allreduce_inductor_wait
work, y, out_ref = _run_loop_collective_wait(
File "/home/sdp/penghuic/pytorch/test/distributed/test_inductor_collectives.py", line 302, in _run_loop_collective_wait
out = wait_fn(work, y)
File "/home/sdp/penghuic/pytorch/test/distributed/test_inductor_collectives.py", line 276, in all_reduce_wait
work.wait(datetime.timedelta(seconds=10))
RuntimeError: Work ran time out after 18928 milliseconds.
To execute this test, run the following from the base repo dir:
python test/distributed/test_inductor_collectives.py TestCollectivesMultiProc.test_eager_async_allreduce_inductor_wait
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0