Skip to content

How to gather results on multiple GPUs while testing? ddp #1974

Discussion options

You must be logged in to vote

Use torch.distributed.all_gather to gather and merge the outputs from all GPUs.
And you should remove the redundant examples due to the ddp_sampler adds extra examples to work with multi GPUS. (https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html#DistributedSampler)

Here is the workaround snippet used in my own project.

def gather_distributed(*tensors):
    output_tensors = []
    for tensor in tensors:
        tensor_list = [torch.ones_like(tensor) for _ in range(dist.get_world_size())]
        dist.all_gather(tensor_list, tensor)
        output_tensors.append(torch.cat(tensor_list))
    return output_tensors


def deduplicate_and_sort(index, *tensors):
    reverse_…

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
1 reply
@black0017
Comment options

Answer selected by Borda
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strategy: ddp DistributedDataParallel trainer: test
4 participants
Converted from issue

This discussion was converted from issue #1974 on December 23, 2020 19:23.