Combine outputs in test epochs when using DDP #11086

WouterDurnez · 2021-12-15T16:52:45Z

WouterDurnez
Dec 15, 2021

I'm training a model across two GPUs on patient data (id). In my test steps, I output dictionaries, which contain the id, as well as all the metrics. I store these (a list with a dict per id) at the end of the test epoch, so I can later on statistically evaluate model performances.

I'm experiencing a problem with the test step, however.

# Test step
def test_step(self, batch, batch_idx):

    # Get new input and predict, then calculate loss
    x, y, id = batch["input"], batch["target"], batch["id"]

    # Infer and time inference
    start = time()
    y_hat = self.test_inference(x, self, **self.test_inference_params)
    end = time()

    # Calculate metrics
    id = id[0] if len(id) == 1 else tuple(id)

    # Output dict with duration of inference
    output = {"id": id, "time": end - start}

    # Add other metrics to output dict
    for m, pars in zip(self.metrics, self.metrics_params):

        metric_value = m(y_hat, y, **pars)

        if hasattr(metric_value, "item"):
            metric_value = metric_value.item()

        output[f"test_{m.__name__}"] = metric_value

    return output

# Test epoch end (= test end)
def test_epoch_end(self, outputs):

    # Go over outputs and gather
    self.test_results = outputs     #self.all_gather(outputs)

I hadn't considered this before (as I'm used to training on a single GPU), but the test_results attribute now only contains half of the outputs (one half per process). So when my main script reaches this section, only half the output is effectively stored:

log("Evaluating model.")
trainer.test(model=model,
             dataloaders=brats.val_dataloader())
results = model.test_results

# Save test results
log("Saving results.")
np.save(file=join(result_dir, f'{model_name}_v{version}_fold{fold_index}.npy'), arr=results)

I have read about the self.all_gather method, but I'm not sure it suits my needs. I want to merge the lists, not reduce anything. Also, they're not Tensors, but dicts. How can I store all dicts across both DDP processes?

Answered by rohitgr7

Dec 16, 2021

all_gather is different from all_reduce. It doesn't do any math operation here.
sort of like:

all_gather -> collect outputs from all devices
all_reduce -> in general, collect outputs from all devices and reduce (apply a math op)

all_gather isn't working for you?

View full answer

rohitgr7 · 2021-12-16T14:06:51Z

rohitgr7
Dec 16, 2021

all_gather is different from all_reduce. It doesn't do any math operation here.
sort of like:

all_gather -> collect outputs from all devices
all_reduce -> in general, collect outputs from all devices and reduce (apply a math op)

all_gather isn't working for you?

9 replies

awaelchli Dec 18, 2021

Can you convert the contents to tensors (using torch.tensor(x, device=self.device)) and try again with self.all_gather?

WouterDurnez Dec 18, 2021
Author

I'll give that a go, thanks. I'll update when I have results.

WouterDurnez Dec 23, 2021
Author

Seems like it all worked! Thanks @awaelchli and @rohitgr7!

icoz69 Mar 30, 2022

hi, it seems that test_epoch_end is also runned on each GPU node, right? Does it mean each node gathers the outputs from all nodes and does the same thing?

rohitgr7 Apr 4, 2022

yes @icoz69,
not just on each GPU node, but on each device. To add some restrictions, you can use trainer.is_global_zero so that it will collect only on one of the device across nodes.

def test_epoch_end(...):
    if self.trainer.is_global_zero:
        # collect or gather

Jerzy97 · 2022-01-26T10:28:00Z

Jerzy97
Jan 26, 2022

Hi @WouterDurnez,
I am in a very similar situation and I have a follow up. How did you solve the problem of keeping track of the patient ID?
Did you map every ID (string) to a unique integer?

Thanks for your hints

2 replies

WouterDurnez Jan 26, 2022
Author

Hi @Jerzy97. Yes, in my case, the ids were already as simple as f"BraTS{patient integer id}", so I just parsed the integer. But you can map just as easily, sure. That solved it for me.

Jerzy97 Jan 26, 2022

I see - thanks for your response!

mgwillia · 2022-07-08T18:38:33Z

mgwillia
Jul 8, 2022

I would think all_gather_object would be optimal given the data types, no?

0 replies

Combine outputs in test epochs when using DDP #11086

Uh oh!

Replies: 3 comments · 11 replies

Uh oh!

Uh oh!

Uh oh!

WouterDurnez Dec 18, 2021 Author

Uh oh!

WouterDurnez Dec 23, 2021 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WouterDurnez Jan 26, 2022 Author

Uh oh!

Uh oh!

Replies: 3 comments 11 replies

WouterDurnez Dec 18, 2021
Author

WouterDurnez Dec 23, 2021
Author

WouterDurnez Jan 26, 2022
Author