How to calculate metric over entire validation set when training with DDP? #3225
Replies: 14 comments 18 replies
-
I found a workaround where we only use the Line 166: Now all processes run inference on the entire validation set, which seems inefficient (probably the same speed as single GPU validation), so they all return the same metrics. In MMDetection, there is a class
I'll take a look at the source code to see if something like this could be integrated into Lightning. |
Beta Was this translation helpful? Give feedback.
-
https://pytorch-lightning.readthedocs.io/en/latest/metrics.html#auroc |
Beta Was this translation helpful? Give feedback.
-
I run this on my metrics in
though you might have to cast your tensor to cuda (self.device) first |
Beta Was this translation helpful? Give feedback.
-
How can I calculate a custom metric over the entire set? |
Beta Was this translation helpful? Give feedback.
-
Interested in this as well, so far I only know how to calculate on each gpu then reduce |
Beta Was this translation helpful? Give feedback.
-
I am currently trying something similar as you attempted @s-rog - the idea is to pickle the results for each rank and then collect them afterwards on rank=0. |
Beta Was this translation helpful? Give feedback.
-
Please see #3159 for a temp solution. I have it tested to work on my code. |
Beta Was this translation helpful? Give feedback.
-
@psinger looking into the torch distributed docs I think we need to:
or
I'm assuming this is only for logging purposes as backprop would probably cause issues. Also this is just from looking at the docs, haven't tried it out yet. Edit: |
Beta Was this translation helpful? Give feedback.
-
@awaelchli The pl metric AUROC does not have reduce_group or reduce_op defined, can it still reduce across DDP? |
Beta Was this translation helpful? Give feedback.
-
@sooheon hmm, I don't think you can define a meaningful reduction operation for that metric. The best is to gather all pairs and then compute the roc once for all data. |
Beta Was this translation helpful? Give feedback.
-
Currently if I use AUROC in val_epoch_end, does this happen? |
Beta Was this translation helpful? Give feedback.
-
Note that we are working on implementing aggregation for metrics (this PR #3321 has started the process) such that each metric gets an |
Beta Was this translation helpful? Give feedback.
-
Class based metrics have been revamped! |
Beta Was this translation helpful? Give feedback.
-
I have the same problem. I want to compute some metrics on the entire validation set while using DDP. Could you confirm that it's supported to calculate F1, ROC AUC, PR AUC in the latest PyTorch Lightning version now? If I want to calculate a customer metric, what should I do? Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I started refactoring my code into Lightning yesterday. When I perform validation, I save all the predictions over the entire validation set and then calculate the validation metrics on all validation data at once. This is especially important for metrics like AUROC.
I am training a model with DDP on 4 GPUs. I have a
validation_epoch_end
method to calculate a metric over the entire validation set:Here is a script that illustrates what the problem I'm encountering:
snippet.zip
However, when using DDP, this method gets called separately in each process, so I end up calculating the metric 4 times on 1/4 of the overall validation set. When I look at the values of each of the 4 AUROCs and the value that gets saved to
checkpoint_on
, the saved value is just 1 of the 4 (I'm assuming the one calculated by the process with rank 0?).I tried using the built-in pytorch_lightning metrics, but those give me a RuntimeError: Tensors must be CUDA and dense. This is using the most current branch (0.9.1.dev).
There may be a simple solution to this, but I spent the last few hours combing through the docs and existing issues without any luck.
Thanks in advance to anyone who can help.
Beta Was this translation helpful? Give feedback.
All reactions