How to gather results on multiple GPUs while testing? ddp #1974

joe32140 · 2020-05-27T17:53:23Z

joe32140
May 27, 2020

❓ Questions and Help

What is your question?

I want to test summarization model from huggingface summarization example on multiple GPUs . My problem is how could I collect test results on different GPUs , since test_epoch_end only processes epoch for a single GPU.
For more information, the model is trained with ddp backend.

Code

 def test_epoch_end(self, outputs):
        output_test_predictions_file = os.path.join(self.hparams.output_dir, "test_predictions.txt")
        output_test_targets_file = os.path.join(self.hparams.output_dir, "test_targets.txt")
        # write predictions and targets for later rouge evaluation.
        with open(output_test_predictions_file, "w+") as p_writer, open(output_test_targets_file, "w+") as t_writer:
            for output_batch in outputs:
                p_writer.writelines(s + "\n" for s in output_batch["preds"])
                t_writer.writelines(s + "\n" for s in output_batch["target"])
            p_writer.close()
            t_writer.close()

        return self.test_end(outputs)

What have you tried?

For now, I can only use single GPU to get result of whole dataset.

What's your environment?

OS: Unbuntu 18.04
Packaging pip
Version: 0.7.6

Answered by haichao592

May 29, 2020

Use torch.distributed.all_gather to gather and merge the outputs from all GPUs.
And you should remove the redundant examples due to the ddp_sampler adds extra examples to work with multi GPUS. (https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html#DistributedSampler)

Here is the workaround snippet used in my own project.

def gather_distributed(*tensors):
    output_tensors = []
    for tensor in tensors:
        tensor_list = [torch.ones_like(tensor) for _ in range(dist.get_world_size())]
        dist.all_gather(tensor_list, tensor)
        output_tensors.append(torch.cat(tensor_list))
    return output_tensors


def deduplicate_and_sort(index, *tensors):
    reverse_…

View full answer

haichao592 · 2020-05-29T16:27:58Z

haichao592
May 29, 2020

Use torch.distributed.all_gather to gather and merge the outputs from all GPUs.
And you should remove the redundant examples due to the ddp_sampler adds extra examples to work with multi GPUS. (https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html#DistributedSampler)

Here is the workaround snippet used in my own project.

def gather_distributed(*tensors):
    output_tensors = []
    for tensor in tensors:
        tensor_list = [torch.ones_like(tensor) for _ in range(dist.get_world_size())]
        dist.all_gather(tensor_list, tensor)
        output_tensors.append(torch.cat(tensor_list))
    return output_tensors


def deduplicate_and_sort(index, *tensors):
    reverse_index = torch.zeros_like(index)
    for ri, i in enumerate(index):
        reverse_index[i] = ri
    reverse_index = reverse_index[:index.max() + 1]
    output_tensors = [tensor.index_select(0, reverse_index) for tensor in tensors]
    return output_tensors

In the above code, you need the index of each example to remove redundant examples and sort outputs in order.
Notice that the index should consist of consecutive integers (e.g., 0,1,2,...N).

1 reply

black0017 Jun 23, 2022

And you should remove the redundant examples due to the ddp_sampler adds extra examples to work with multi GPUS.

Hi @haichao592 , some further explanation or an example would be super helpful to clarify this ^

Q1: how can you remove the redundant examples?
Q2: I get what gather_distributed does but how can I use deduplicate_and_sort to remove the extra examples?
Q3: Is the index a list of integers that correspond to the dataset or dataloader class?

Borda · 2020-06-11T09:04:07Z

Borda
Jun 11, 2020
Maintainer

Thanks to @haichao592 👍
I assume this is resolved now @joe32140 anyway feel free to reopen if needed 🦝

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to gather results on multiple GPUs while testing? ddp #1974

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to gather results on multiple GPUs while testing? ddp #1974

Uh oh!

joe32140 May 27, 2020

❓ Questions and Help

What is your question?

Code

What have you tried?

What's your environment?

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

haichao592 May 29, 2020

Uh oh!

black0017 Jun 23, 2022

Uh oh!

Borda Jun 11, 2020 Maintainer

joe32140
May 27, 2020

Replies: 2 comments 1 reply

haichao592
May 29, 2020

Borda
Jun 11, 2020
Maintainer