Run specific code only once (which generates randomized values) before starting DDP #9134

Gateway2745 · 2021-08-26T11:30:04Z

Gateway2745
Aug 26, 2021

Hi. I have a function which generates a set of random values (hyperparameters) which are then used to create my model. I want to run this function only once, then use it to create my model and then start ddp training on this model.

However, with the current setup, when I start ddp, the randomize function gets called again, so now I have 2 GPU processes, each having initialized the model with different set of hyperparameters. (random values from both calls aren't same)

If I add if os.getenv("LOCAL_RANK",0): before my randomize function, then there is no way for the second GPU process to access the hyperparameters generated by the first GPU process. How do I go about this ? Thanks.

Answered by tchaton

Aug 26, 2021

Hey @Gateway2745,

You could do this

from pytorch_lightning.utilities.cli import LightningCLI
from unittest import mock
import optuna

config_path = ...

class MyModel(LightningModule):

    def __init__(self, num_layers):
        ...

def objective(trial):
    num_layers = trial.suggest_uniform('num_layers', 10, 100)

    with mock.patch("sys.argv", ["any.py", "--config", str(config_path), "--trainer.accelerator", "ddp_spawn", "--trainer.gpu", "2", "--model.num_layers", str(num_layers)]):
        cli = LightningCLI(MyModel, MyDataModule)

    return cli.trainer.model_checkpoint.best_score

study = optuna.create_study()
study.optimize(objective, n_trials=100)
study.best_params

View full answer

justusschock · 2021-08-26T11:48:25Z

justusschock
Aug 26, 2021
Maintainer

If that isn't to costly to generate, I'd recommend to generate them at every process and then use ddp broadcasting to overwrite the values with the ones from the main process (src=0)

3 replies

Gateway2745 Aug 26, 2021
Author

Hello Justus. Thank you! Ill give this a try. Also I have not used ddp_spawn before, but if I do use this, would the objects from the main process be shared among all processes? Would this be a feasible solution?

justusschock Aug 26, 2021
Maintainer

Hi @Gateway2745

This is hard to answer in general. If you do have an example (even pseudo code could be fine), it is probably easier to answer :)

In general, the model and trainer are copied over but some other things aren't. So it depends on where you store those parameters

Gateway2745 Aug 26, 2021
Author

Sure, I'll first give ddp broadcasting a try and come back to you with the pseudo code for the ddp_spawn in case i'm having trouble with broadcasting. Thanks again!

tchaton · 2021-08-26T15:19:42Z

tchaton
Aug 26, 2021
Maintainer

Hey @Gateway2745,

Here is an example where I am broadcasting the current checkpoint tmpdir to all processes: https://github.com/PyTorchLightning/pytorch-lightning/blob/522df2b89b35c050b14bb5e9c2ba2c3d1d20ea67/tests/core/test_metric_result_integration.py#L468

Best,
T.C

0 replies

Gateway2745 · 2021-08-26T15:27:09Z

Gateway2745
Aug 26, 2021
Author

Hello @tchaton. Thank you for the helpful example. I have a quick question. Since I would need to broadcast my hyperparameters (present in a python dictionary) before I even create my model, where would be the best place to do this? I guess I would need to do it before trainer.fit since it takes the model as a parameter but the process group is also initiated after this call only. How would I broadcast my parameters before this call? Thank you! Also cc @justusschock :)

3 replies

tchaton Aug 26, 2021
Maintainer

Hey @Gateway2745, actually this is quite challenging :)

Is this function deterministic ? Meaning that you could set the seed and it would work for all processes ?
I wonder if ddp_spawn won't be simpler for your use case.

Best.
T.C

justusschock Aug 26, 2021
Maintainer

The other approach would be to rely on some other launching like horovod or torchelastic to launch the script and then manually create the distributed process groups so that you can then use broadcasting. However, that would be quite a lot of effort and bookkeeping you'd have to do manually there.

So I guess, in that case ddp_spawn would probably the preferred approach as @tchaton pointed out.

Gateway2745 Aug 26, 2021
Author

Thanks for your suggestions! Yes, I can set the random seed for the initial hyperparameter generation. So, I am basically trying to do hyperparameter search and the next set of hyperparameter values depends on the val_acc of the previous trials. So I think for the initial trial, the seed will take care of consistency, and for subsequent trials I can broadcast as I now have the process group. I saw that ray-tune has support for pytorch lightning, but was not sure if it has support for ddp on multiple nodes. Pytorch-Lightning + Ray-Tune + Slurm becomes quites tricky :P

My workflow is quite simple, its like-
Obtain a set of hyperparameter values (e.g using bayesian strategy.) => train model using ddp => return val_acc to generate next trial values => obtain new set of hyperparams => train using ddp ....

tchaton · 2021-08-26T18:51:49Z

tchaton
Aug 26, 2021
Maintainer

Hey @Gateway2745,

You could do this

from pytorch_lightning.utilities.cli import LightningCLI
from unittest import mock
import optuna

config_path = ...

class MyModel(LightningModule):

    def __init__(self, num_layers):
        ...

def objective(trial):
    num_layers = trial.suggest_uniform('num_layers', 10, 100)

    with mock.patch("sys.argv", ["any.py", "--config", str(config_path), "--trainer.accelerator", "ddp_spawn", "--trainer.gpu", "2", "--model.num_layers", str(num_layers)]):
        cli = LightningCLI(MyModel, MyDataModule)

    return cli.trainer.model_checkpoint.best_score

study = optuna.create_study()
study.optimize(objective, n_trials=100)
study.best_params

2 replies

Gateway2745 Aug 26, 2021
Author

Thank you again! This seems to be a much cleaner solution. Also, here I can replace the accelerator with 'ddp' right? I'll try this out and reply on this thread if it works for my case.

tchaton Aug 26, 2021
Maintainer

It might work with DDP if you use SLURM or TorchElastic, but need to be tried :)

Run specific code only once (which generates randomized values) before starting DDP #9134

Uh oh!

Uh oh!

Gateway2745 Aug 26, 2021

Replies: 4 comments · 8 replies

Uh oh!

Uh oh!

justusschock Aug 26, 2021 Maintainer

Uh oh!

Uh oh!

Gateway2745 Aug 26, 2021 Author

Uh oh!

Uh oh!

justusschock Aug 26, 2021 Maintainer

Uh oh!

Gateway2745 Aug 26, 2021 Author

Uh oh!

tchaton Aug 26, 2021 Maintainer

Uh oh!

Uh oh!

Gateway2745 Aug 26, 2021 Author

Uh oh!

tchaton Aug 26, 2021 Maintainer

Uh oh!

Uh oh!

justusschock Aug 26, 2021 Maintainer

Uh oh!

Uh oh!

Gateway2745 Aug 26, 2021 Author

Uh oh!

tchaton Aug 26, 2021 Maintainer

Uh oh!

Uh oh!

Gateway2745 Aug 26, 2021 Author

Uh oh!

tchaton Aug 26, 2021 Maintainer

Gateway2745
Aug 26, 2021

Replies: 4 comments 8 replies

justusschock
Aug 26, 2021
Maintainer

Gateway2745 Aug 26, 2021
Author

justusschock Aug 26, 2021
Maintainer

Gateway2745 Aug 26, 2021
Author

tchaton
Aug 26, 2021
Maintainer

Gateway2745
Aug 26, 2021
Author

tchaton Aug 26, 2021
Maintainer

justusschock Aug 26, 2021
Maintainer

Gateway2745 Aug 26, 2021
Author

tchaton
Aug 26, 2021
Maintainer

Gateway2745 Aug 26, 2021
Author

tchaton Aug 26, 2021
Maintainer