Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Trials hang when using a scheduler #253

Open
dcfidalgo opened this issue Mar 17, 2023 · 0 comments · May be fixed by #254
Open

Trials hang when using a scheduler #253

dcfidalgo opened this issue Mar 17, 2023 · 0 comments · May be fixed by #254

Comments

@dcfidalgo
Copy link

Hi there!
I first encountered this issue when trying to do a PBT on a Multi-node DDP setup (4GPUs per node, each node is a population member), but I could not consistently reproduce it.
But now I managed to reproduce the same behavior using an ASHA scheduler: As soon as the ASHA scheduler terminates a trial, the subsequent trials simply hang in the RUNNING status and never terminate.

== Status ==
Current time: 2023-03-17 10:12:33 (running for 00:00:41.50)
Memory usage on this node: 154.0/250.9 GiB 
Using AsyncHyperBand: num_stopped=1
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: -1.25
Resources requested: 3.0/4 CPUs, 0/0 GPUs, 0.0/64.44 GiB heap, 0.0/31.61 GiB objects
Result logdir: /dcfidalgo/ray_results/train_func_2023-03-17_10-11-51
Number of trials: 3/3 (1 RUNNING, 2 TERMINATED)
+------------------------+------------+---------------------+------------+--------+------------------+------------+
| Trial name             | status     | loc                 |   val_loss |   iter |   total time (s) |   val_loss |
|------------------------+------------+---------------------+------------+--------+------------------+------------|
| train_func_c1436_00002 | RUNNING    | 10.181.103.72:74356 |          3 |        |                  |            |
| train_func_c1436_00000 | TERMINATED | 10.181.103.72:74356 |          1 |      1 |          6.91809 |          1 |
| train_func_c1436_00001 | TERMINATED | 10.181.103.72:74356 |          2 |      1 |          6.20699 |          2 |
+------------------------+------------+---------------------+------------+--------+------------------+------------+

I could trace back the issue to a hanging ray.get call when trying to get the self._master_addr here. But I simply cannot figure out what the underlying cause is ...

A minimal script to reproduce the issue:

import torch
import ray
from ray import tune
from ray.tune.schedulers import AsyncHyperBandScheduler

from ray_lightning import RayStrategy
from ray_lightning.tests.utils import BoringModel, get_trainer
from ray_lightning.tune import TuneReportCallback, get_tune_resources


class AnotherBoringModel(BoringModel):
    def __init__(self, val_loss: float):
        super().__init__()
        self._val_loss = torch.tensor(val_loss)

    def validation_step(self, batch, batch_idx):
        self.log("val_loss", self._val_loss)
        return {"x": self._val_loss}


address_info = ray.init(num_cpus=4)


strategy = RayStrategy(num_workers=2, use_gpu=False)
callbacks = [TuneReportCallback(on="validation_end")]


def train_func(config):
    model = AnotherBoringModel(config["val_loss"])
    trainer = get_trainer(
        "./",
        callbacks=callbacks,
        strategy=strategy,
        checkpoint_callback=False,
        max_epochs=1)
    trainer.fit(model)


tune.run(
    train_func,
    config={"val_loss": tune.grid_search([1., 2., 3.])},
    resources_per_trial=get_tune_resources(
        num_workers=strategy.num_workers, use_gpu=strategy.use_gpu),
    num_samples=1,
    scheduler=AsyncHyperBandScheduler(metric="val_loss", mode="min")
)

If you remove the scheduler, the above script terminates without issues.

A corresponding conda env:

name: schedulerbug
channels:
  - pytorch
dependencies:
  - python=3.9
  - pytorch==1.11.0
  - cpuonly
  - pip
  - pip:
    - pytorch-lightning==1.6.4
    - ray[tune]==2.3.0
    - git+https://github.com/ray-project/ray_lightning.git@main

Is someone experiencing the same issue? Any kind of help would be very much appreciated! 😃
Have a great day!

dcfidalgo added a commit to dcfidalgo/ray_lightning that referenced this issue Mar 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant