Support distributed evaluation #176

mehdidc · 2022-09-24T17:11:54Z

Currently, evaluation is done on rank zero. This PR provides support for distributed evaluation (using an optional --distributed-evaluation argument) to make evaluation faster (supports both zero-shot and retrieval).

…ranks

rom1504 · 2022-11-07T14:11:36Z

this LGTM

could you please double check @mitchellnw or @rwightman ?

mitchellnw · 2022-11-07T19:50:46Z

In the pytorch imagenet example for distributed imagenet eval they have an aux_val_loader to handle the case where the test set size is not divisible by num_gpus - do we need to have this? https://github.com/pytorch/examples/blob/main/imagenet/main.py#L396-L402

usuyama · 2022-11-13T22:46:44Z

src/training/train.py

        assert wandb is not None, 'Please install wandb.'
        for name, val in metrics.items():
            wandb.log({f"val/{name}": val, 'epoch': epoch})

    return metrics


-def get_metrics(image_features, text_features, logit_scale):
+def get_metrics(image_features, text_features, logit_scale, args):


nit pick but maybe better to pass a more specific argument e.g. gather_tensor=args.distributed_evaluation?

nit pick but maybe better to pass a more specific argument e.g. gather_tensor=args.distributed_evaluation?

Thanks I agree, I think it's better

usuyama · 2022-11-13T22:56:34Z

src/training/data.py

@@ -359,7 +361,8 @@ def get_wds_dataset(args, preprocess_img, is_train, epoch=0, floor=False):
        num_samples = num_batches * global_batch_size
        dataset = dataset.with_epoch(num_worker_batches)  # each worker is iterating over this
    else:
-        # last batches are partial, eval is done on single (master) node
+        if args.distributed_evaluation:
+            num_samples = num_samples // args.world_size


maybe I'm misunderstanding, but does it skip the last few samples due to the last partial batch?

maybe I'm misunderstanding, but does it skip the last few samples due to the last partial batch?

in evaluation, num_samples is only used for logging :

open_clip/src/training/train.py

Line 172 in 03839c5

samples_per_val = dataloader.num_samples

,

open_clip/src/training/train.py

Line 217 in 03839c5

f"Eval Epoch: {epoch} [{num_samples} / {samples_per_val}]\t"

it is not affecting the dataloader which depends only on the wds pipeline. But it would be good to get it correct anyway, have to check how many examples each worker exactly receive.

mehdidc · 2022-11-14T09:48:12Z

In the pytorch imagenet example for distributed imagenet eval they have an aux_val_loader to handle the case where the test set size is not divisible by num_gpus - do we need to have this? https://github.com/pytorch/examples/blob/main/imagenet/main.py#L396-L402

Thanks for the link, not sure why they do drop_last=True on val_loader (not used here), probably to avoid having a GPU worker with much fewer examples than the others? so rather, they seem to do drop_last, and compute the last few examples validation performance in all GPU workers.

mitchellnw · 2022-11-14T23:04:00Z

@mehdidc I think this is actually necessary or else you can get different val perf when different numbers of gpus are used, e.g., see this comment: https://github.com/facebookresearch/deit/blob/main/main.py#L221-L223

mehdidc · 2022-11-15T00:14:50Z

@mehdidc I think this is actually necessary or else you can get different val perf when different numbers of gpus are used, e.g., see this comment: https://github.com/facebookresearch/deit/blob/main/main.py#L221-L223

I see, thanks @mitchellnw! OK so I need to fix this. I really thought that DistributedSampler with drop_last=False would do the the "right" thing (although wouldn't be ideal for disributed setting if it would be case) in the sense of seing each example exactly once (even when dataset size is not divisible by nb of workers, e.g. the remaining examples can be distributed to a subset of workers)

dmlpt · 2023-02-15T02:52:27Z

Is this argument "--distributed_evaluation" not available in the current version?

mehdidc · 2023-02-19T15:48:43Z

@dmlpt Not yet, I still need to fix the val dataloader like @mitchellnw mentions and rebase on master

mehdidc and others added 6 commits September 24, 2022 15:05

support distributed validation for zero-shot and retrieval

25bf954

update with distributed evaluation for retrieval

9b26fef

retrieval eval: fix gather with variable length tensors on different …

5f9a978

…ranks

fix zero-shot distributed eval

ff36243

zero-shot distributed eval: all reduce metrics all together

84b2dc7

Merge branch 'main' into distributed_validation

03839c5

rom1504 requested review from rwightman, rom1504 and mitchellnw November 7, 2022 14:11

rom1504 approved these changes Nov 7, 2022

View reviewed changes

usuyama reviewed Nov 13, 2022

View reviewed changes

mehdidc mentioned this pull request Feb 19, 2023

Scaling validation to large datasets #436

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support distributed evaluation #176

Support distributed evaluation #176

mehdidc commented Sep 24, 2022

rom1504 commented Nov 7, 2022

mitchellnw commented Nov 7, 2022

usuyama Nov 13, 2022

mehdidc Nov 14, 2022

usuyama Nov 13, 2022

mehdidc Nov 14, 2022 •

edited

Loading

mehdidc commented Nov 14, 2022

mitchellnw commented Nov 14, 2022

mehdidc commented Nov 15, 2022

dmlpt commented Feb 15, 2023

mehdidc commented Feb 19, 2023 •

edited

Loading

Support distributed evaluation #176

Are you sure you want to change the base?

Support distributed evaluation #176

Conversation

mehdidc commented Sep 24, 2022

rom1504 commented Nov 7, 2022

mitchellnw commented Nov 7, 2022

usuyama Nov 13, 2022

Choose a reason for hiding this comment

mehdidc Nov 14, 2022

Choose a reason for hiding this comment

usuyama Nov 13, 2022

Choose a reason for hiding this comment

mehdidc Nov 14, 2022 • edited Loading

Choose a reason for hiding this comment

mehdidc commented Nov 14, 2022

mitchellnw commented Nov 14, 2022

mehdidc commented Nov 15, 2022

dmlpt commented Feb 15, 2023

mehdidc commented Feb 19, 2023 • edited Loading

mehdidc Nov 14, 2022 •

edited

Loading

mehdidc commented Feb 19, 2023 •

edited

Loading