Trainer with ddp accelerator create too much gpu processes #4828

shenmishajing · 2020-11-24T02:29:29Z

shenmishajing
Nov 24, 2020

❓ Questions and Help

What is your question?

When I use ddp accelerator with multi gpu on single machine, I get this from cmd gpustat

Gpu name	Utilization	memory	Process
[0]GeForce GTX 1080 Ti	27'C, 100 %	11007 / 11178 MB	xx(11003M) xx(0M) xx(0M) xx(0M) xx(0M) xx(0M)
[1] GeForce GTX 1080 Ti	64'C, 100 %	11025 / 11178 MB	xx(0M) xx(11021M) xx(0M) xx(0M) xx(0M) xx(0M)
[2] GeForce GTX 1080 Ti	67'C, 100 %	11015 / 11178 MB	xx(0M) xx(0M) xx(11011M) xx(0M) xx(0M) xx(0M)
[3] GeForce GTX 1080 Ti	62'C, 100 %	11137 / 11178 MB	xx(0M) xx(0M) xx(0M) xx(11133M) xx(0M) xx(0M)
[4] GeForce GTX 1080 Ti	59'C, 100 %	11023 / 11178 MB	xx(0M) xx(0M) xx(0M) xx(0M)xx(0M) xx(11019M)
[5] GeForce GTX 1080 Ti	50'C, 100 %	11031 / 11178 MB	xx(0M) xx(0M) xx(0M) xx(0M) xx(11027M) xx(0M)
[6] GeForce GTX 1080 Ti	24'C, 0 %	10665 / 11178 MB	aaa(10661M) xx(0M) xx(0M) xx(0M) xx(0M)xx(0M)xx(0M)

In which xx is my user name and aaa is another user. I only use the first 6 gpus and leave the last gpu for other user. But, my code create too much gpu process on first 6 gpus and even create same number of processes on the last gpu.

My code looks like:

trainer = Trainer(gpu=[0,1,2,3,4,5], accelerator='ddp')
trainer.fit(model, datamodule = data_module)

It looks like every cpu process will create a gpu process in every gpu visible, although they will use one gpu only. How can I enforce them to only create one gpu process on the right gpu? I want the gpu process to look like:

Gpu name	Utilization	memory	Process
[0]GeForce GTX 1080 Ti	27'C, 100 %	11007 / 11178 MB	xx(11003M)
[1] GeForce GTX 1080 Ti	64'C, 100 %	11025 / 11178 MB	xx(11021M)
[2] GeForce GTX 1080 Ti	67'C, 100 %	11015 / 11178 MB	xx(11011M)
[3] GeForce GTX 1080 Ti	62'C, 100 %	11137 / 11178 MB	xx(11133M)
[4] GeForce GTX 1080 Ti	59'C, 100 %	11023 / 11178 MB	xx(11019M)
[5] GeForce GTX 1080 Ti	50'C, 100 %	11031 / 11178 MB	xx(11027M)
[6] GeForce GTX 1080 Ti	24'C, 0 %	10665 / 11178 MB	aaa(10661M)

What's your environment?

OS: Ubuntu 16.04.3 with kernel version 4.4.0-87-generic
Packaging: conda
Version 1.0.6 with pytorch 1.6.0

2020-11-24T02:30:13Z

github-actions[bot]
bot Nov 24, 2020

Hi! thanks for your contribution!, great first issue!

0 replies

s-rog · 2020-11-24T05:24:25Z

s-rog
Nov 24, 2020

Reduce the number of workers in dataloader

0 replies

shenmishajing · 2020-11-24T07:18:17Z

shenmishajing
Nov 24, 2020
Author

Reduce the number of workers in dataloader

I have set it to 0, but it doesn't work.

0 replies

s-rog · 2020-11-24T08:11:15Z

s-rog
Nov 24, 2020

Are you experiencing any performance degradation?

I'm not familiar with gpustat, can you look at the process tree in htop?

0 replies

shenmishajing · 2020-11-24T08:41:29Z

shenmishajing
Nov 24, 2020
Author

Are you experiencing any performance degradation?

I'm not familiar with gpustat, can you look at the process tree in htop?

In fact, there is no performance degradation. But we have many users on the same machine, so creating gpu process unnecessarily may confuse other users.

gpustat is a tools to watch the gpu status, similar to nvidia-smi, I use it just because it use less screen space to draw the table and it will show which gpu the process is using. In the table column process, each username stands for a gpu process, and the number in () following it represents the memory used by the gpu process.

the process tree in htop looks like:

by the way, from the nvidia-smi cmd, we can see there are realy too much gpu processes.

and if we use gpustat -p, we can get this, the number after the username/ is the pid of the cpu process.

we find there are many gpu process with the same pid, so I think the cpu process really create a gpu process for every gpu, which makes such a mess.

0 replies

s-rog · 2020-11-24T09:07:17Z

s-rog
Nov 24, 2020

are you showing threads in htop as well? (the hide userland process threads option)
also do you get the same issue with ddp_spawn?

Some minimal reproducible code would be helpful!

0 replies

ananyahjha93 · 2020-11-24T15:58:14Z

ananyahjha93
Nov 24, 2020

@shenmishajing I think num_workers no longer controls the exact number of threads created per GPU in pytorch. One post which highlights this: https://discuss.pytorch.org/t/total-number-of-processes-and-threads-created-using-nn-distributed-parallel/71043

But, let me confirm if this is a pytorch and ddp backend thing, rather than a lightning issue.

0 replies

shenmishajing · 2020-11-25T01:56:27Z

shenmishajing
Nov 25, 2020
Author

Sure, I'm showing threads in htop, and if I set the hide userland process threads option, I get this:

As the ddp_spawn accelerator, I get an error like this:
AttributeError: Can't pickle local object 'TorchGraph.hook_torch_modules.<locals>.backward_hook'
Any idea to fix this?

A minimal reproducible example may look like this:

# !/usr/bin/python
# -*- coding: UTF-8 -*-

import torch
import pytorch_lightning as pl
from torch.utils.data import Dataset


# a simple dataset
class Dataset(Dataset):
    def __init__(self, n):
        self.n = n

    def __getitem__(self, index):
        return torch.Tensor(10), torch.Tensor(10)

    def __len__(self):
        return self.n


# a simple model
class Model(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(10, 10)

    def forward(self, x):
        return self.fc(x)

    def training_step(self, batch, batch_idx):
        x, target = batch
        output = self(x)
        loss = torch.nn.functional.smooth_l1_loss(output, target)
        return loss

    def validation_step(self, batch, batch_idx):
        return self.training_step(batch, batch_idx)

    def test_step(self, batch, batch_idx):
        return self.validation_step(batch, batch_idx)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.fc.parameters())
        return optimizer


def main():
    train_dataloader = torch.utils.data.DataLoader(Dataset(10000))
    val_dataloader = torch.utils.data.DataLoader(Dataset(100))
    model = Model()
    trainer = pl.Trainer(gpus = [0, 1, 2, 3], accelerator = 'ddp')
    trainer.fit(model, train_dataloader, val_dataloader)


if __name__ == '__main__':
    main()

0 replies

s-rog · 2020-11-25T02:27:01Z

s-rog
Nov 25, 2020

Yeah so the many processes you are seeing before are actually threads and like @ananyahjha93 said, I'm pretty sure that is a pytorch behavior.

That's an issue with TorchGraph and not lightning, it has to do with the way certain functions are implemented that makes them un-picklable. I would stick with ddp for now if you're using TorchGraph then and perhaps post an issue there.

0 replies

shenmishajing · 2020-11-25T02:51:30Z

shenmishajing
Nov 25, 2020
Author

When I use the Trainer with ddp accelerator, I will get some output like this:

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6]
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6]
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4

It looks like every process can see all the gpus. Will it fix this issue, if we set the CUDA_VISIBLE_DEVICES variable in every process before we run the train code?

0 replies

JindongJiang · 2020-11-27T20:33:00Z

JindongJiang
Nov 27, 2020

Same problem. Besides, if we set, say gpus=[3, 4], the first two GPUs will also be used but not really doing any computation (GPU-Util 0%). And I got the similar output:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2

Is this an issue with PyTorch or PL?

0 replies

shenmishajing · 2020-11-28T09:48:43Z

shenmishajing
Nov 28, 2020
Author

When I use the Trainer with ddp accelerator, I will get some output like this:
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6]
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6]
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
It looks like every process can see all the gpus. Will it fix this issue, if we set the CUDA_VISIBLE_DEVICES variable in every process before we run the train code?

A solution for this.
You can use CUDA_VISIBLE_DEVICES to set the cuda device visible to your script, it can not solve the problem for creating too much threads (it is a bug of ddp backend), but it will make your script do not create thread on the gpus you don't need.
Like,
By cmd python main.py, we get

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6]
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6]
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4

Gpu name	Utilization	memory	Process
[0]GeForce GTX 1080 Ti	27'C, 0 %	0 / 11178 MB	xx(0M) xx(0M) xx(0M) xx(0M)
[1] GeForce GTX 1080 Ti	64'C, 100 %	11025 / 11178 MB	xx(11021M) xx(0M) xx(0M) xx(0M)
[2] GeForce GTX 1080 Ti	67'C, 100 %	11015 / 11178 MB	xx(0M) xx(11011M) xx(0M) xx(0M)
[3] GeForce GTX 1080 Ti	62'C, 100 %	11137 / 11178 MB	xx(0M) xx(0M) xx(11133M) xx(0M)
[4] GeForce GTX 1080 Ti	23'C, 0 %	1 / 11178 MB	xx(0M) xx(0M) xx(0M)xx(0M)
[5] GeForce GTX 1080 Ti	50'C, 100 %	11031 / 11178 MB	xx(0M) xx(0M) xx(0M) xx(11027M)
[6] GeForce GTX 1080 Ti	24'C, 0 %	10665 / 11178 MB	aaa(10661M) xx(0M) xx(0M) xx(0M) xx(0M)

But, if we use cmd CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py, we get

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1,2,3,4]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [1,2,3,4]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [1,2,3,4]
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [1,2,3,4]
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4

Gpu name	Utilization	memory	Process
[0]GeForce GTX 1080 Ti	27'C, 0 %	0 / 11178 MB
[1] GeForce GTX 1080 Ti	64'C, 100 %	11025 / 11178 MB	xx(11021M) xx(0M) xx(0M) xx(0M)
[2] GeForce GTX 1080 Ti	67'C, 100 %	11015 / 11178 MB	xx(0M) xx(11011M) xx(0M) xx(0M)
[3] GeForce GTX 1080 Ti	62'C, 100 %	11137 / 11178 MB	xx(0M) xx(0M) xx(11133M) xx(0M)
[4] GeForce GTX 1080 Ti	59'C, 100 %	11024 / 11178 MB	xx(0M) xx(0M) xx(0M)xx(11027M)
[5] GeForce GTX 1080 Ti	50'C, 0 %	1 / 11178 MB
[6] GeForce GTX 1080 Ti	24'C, 0 %	10665 / 11178 MB	aaa(10661M)

0 replies

mrcolo · 2021-03-31T06:42:50Z

mrcolo
Mar 31, 2021

was this bug in the ddp backend solved?

0 replies

MendelXu · 2021-12-03T07:22:24Z

MendelXu
Dec 3, 2021

Obviously, it is not fixed yet. An alternative solution is to rewrite the ddp backend following the logic of the launch script in pytorch.

0 replies

Trainer with ddp accelerator create too much gpu processes #4828

Uh oh!

Uh oh!

❓ Questions and Help

What is your question?

What's your environment?

Replies: 14 comments

Uh oh!

github-actions[bot] bot Nov 24, 2020

Uh oh!

Uh oh!

shenmishajing Nov 24, 2020 Author

Uh oh!

Uh oh!

Uh oh!

shenmishajing Nov 24, 2020 Author

Uh oh!

Uh oh!

Uh oh!

shenmishajing Nov 25, 2020 Author

Uh oh!

Uh oh!

shenmishajing Nov 25, 2020 Author

Uh oh!

Uh oh!

shenmishajing Nov 28, 2020 Author

Uh oh!

Uh oh!

github-actions[bot]
bot Nov 24, 2020

shenmishajing
Nov 24, 2020
Author

shenmishajing
Nov 24, 2020
Author

shenmishajing
Nov 25, 2020
Author

shenmishajing
Nov 25, 2020
Author

shenmishajing
Nov 28, 2020
Author