Skip to content

[distributed] RuntimeError: Work ran time out after 18928 milliseconds. #1728

Open
@PenghuiCheng

Description

@PenghuiCheng

🐛 Describe the bug

RuntimeError: Work ran time out after 18928 milliseconds.

reproduce step:
pytest -vs test_inductor_collectives.py -k test_eager_async_allreduce_inductor_wait

error message:
Traceback (most recent call last):
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 643, in wrapper
self._join_processes(fn)
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 907, in _join_processes
self._check_return_codes(fn, elapsed_time)
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 947, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 791, in run_test
getattr(self, test_name)()
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 645, in wrapper
fn()
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_utils.py", line 3148, in wrapper
method(*args, **kwargs)
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_distributed.py", line 205, in wrapper
return func(*args, **kwargs)
File "/home/sdp/penghuic/pytorch/torch/testing/_internal/common_utils.py", line 1869, in wrapper
return fn(*args, **kwargs)
File "/home/sdp/penghuic/pytorch/test/distributed/test_inductor_collectives.py", line 310, in test_eager_async_allreduce_inductor_wait
work, y, out_ref = _run_loop_collective_wait(
File "/home/sdp/penghuic/pytorch/test/distributed/test_inductor_collectives.py", line 302, in _run_loop_collective_wait
out = wait_fn(work, y)
File "/home/sdp/penghuic/pytorch/test/distributed/test_inductor_collectives.py", line 276, in all_reduce_wait
work.wait(datetime.timedelta(seconds=10))
RuntimeError: Work ran time out after 18928 milliseconds.

To execute this test, run the following from the base repo dir:
python test/distributed/test_inductor_collectives.py TestCollectivesMultiProc.test_eager_async_allreduce_inductor_wait

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Versions

env_weekly.txt

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingmodule: distributedFor distributed feature issue

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions