You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
Hi there!
I first encountered this issue when trying to do a PBT on a Multi-node DDP setup (4GPUs per node, each node is a population member), but I could not consistently reproduce it.
But now I managed to reproduce the same behavior using an ASHA scheduler: As soon as the ASHA scheduler terminates a trial, the subsequent trials simply hang in the RUNNING status and never terminate.
== Status ==
Current time: 2023-03-17 10:12:33 (running for 00:00:41.50)
Memory usage on this node: 154.0/250.9 GiB
Using AsyncHyperBand: num_stopped=1
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: -1.25
Resources requested: 3.0/4 CPUs, 0/0 GPUs, 0.0/64.44 GiB heap, 0.0/31.61 GiB objects
Result logdir: /dcfidalgo/ray_results/train_func_2023-03-17_10-11-51
Number of trials: 3/3 (1 RUNNING, 2 TERMINATED)
+------------------------+------------+---------------------+------------+--------+------------------+------------+
| Trial name | status | loc | val_loss | iter | total time (s) | val_loss ||------------------------+------------+---------------------+------------+--------+------------------+------------|| train_func_c1436_00002 | RUNNING | 10.181.103.72:74356 | 3 ||||| train_func_c1436_00000 | TERMINATED | 10.181.103.72:74356 | 1 | 1 | 6.91809 | 1 || train_func_c1436_00001 | TERMINATED | 10.181.103.72:74356 | 2 | 1 | 6.20699 | 2 |
+------------------------+------------+---------------------+------------+--------+------------------+------------+
I could trace back the issue to a hanging ray.get call when trying to get the self._master_addrhere. But I simply cannot figure out what the underlying cause is ...
Hi there!
I first encountered this issue when trying to do a PBT on a Multi-node DDP setup (4GPUs per node, each node is a population member), but I could not consistently reproduce it.
But now I managed to reproduce the same behavior using an ASHA scheduler: As soon as the ASHA scheduler terminates a trial, the subsequent trials simply hang in the RUNNING status and never terminate.
I could trace back the issue to a hanging
ray.get
call when trying to get theself._master_addr
here. But I simply cannot figure out what the underlying cause is ...A minimal script to reproduce the issue:
If you remove the scheduler, the above script terminates without issues.
A corresponding conda env:
Is someone experiencing the same issue? Any kind of help would be very much appreciated! 😃
Have a great day!
The text was updated successfully, but these errors were encountered: