Combine outputs in test epochs when using DDP #11086
-
I'm training a model across two GPUs on patient data (id). In my test steps, I output dictionaries, which contain the id, as well as all the metrics. I store these (a list with a dict per id) at the end of the test epoch, so I can later on statistically evaluate model performances. I'm experiencing a problem with the test step, however.
I hadn't considered this before (as I'm used to training on a single GPU), but the test_results attribute now only contains half of the outputs (one half per process). So when my main script reaches this section, only half the output is effectively stored:
I have read about the |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 11 replies
-
all_gather is different from all_reduce. It doesn't do any math operation here.
all_gather isn't working for you? |
Beta Was this translation helpful? Give feedback.
-
Hi @WouterDurnez, Thanks for your hints |
Beta Was this translation helpful? Give feedback.
-
I would think all_gather_object would be optimal given the data types, no? |
Beta Was this translation helpful? Give feedback.
all_gather is different from all_reduce. It doesn't do any math operation here.
sort of like:
all_gather isn't working for you?