-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support distributed evaluation #176
base: main
Are you sure you want to change the base?
Conversation
this LGTM could you please double check @mitchellnw or @rwightman ? |
In the pytorch imagenet example for distributed imagenet eval they have an |
assert wandb is not None, 'Please install wandb.' | ||
for name, val in metrics.items(): | ||
wandb.log({f"val/{name}": val, 'epoch': epoch}) | ||
|
||
return metrics | ||
|
||
|
||
def get_metrics(image_features, text_features, logit_scale): | ||
def get_metrics(image_features, text_features, logit_scale, args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit pick but maybe better to pass a more specific argument e.g. gather_tensor=args.distributed_evaluation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit pick but maybe better to pass a more specific argument e.g.
gather_tensor=args.distributed_evaluation
?
Thanks I agree, I think it's better
@@ -359,7 +361,8 @@ def get_wds_dataset(args, preprocess_img, is_train, epoch=0, floor=False): | |||
num_samples = num_batches * global_batch_size | |||
dataset = dataset.with_epoch(num_worker_batches) # each worker is iterating over this | |||
else: | |||
# last batches are partial, eval is done on single (master) node | |||
if args.distributed_evaluation: | |||
num_samples = num_samples // args.world_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I'm misunderstanding, but does it skip the last few samples due to the last partial batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I'm misunderstanding, but does it skip the last few samples due to the last partial batch?
in evaluation, num_samples
is only used for logging :
open_clip/src/training/train.py
Line 172 in 03839c5
samples_per_val = dataloader.num_samples |
open_clip/src/training/train.py
Line 217 in 03839c5
f"Eval Epoch: {epoch} [{num_samples} / {samples_per_val}]\t" |
it is not affecting the dataloader which depends only on the wds pipeline. But it would be good to get it correct anyway, have to check how many examples each worker exactly receive.
Thanks for the link, not sure why they do drop_last=True on val_loader (not used here), probably to avoid having a GPU worker with much fewer examples than the others? so rather, they seem to do drop_last, and compute the last few examples validation performance in all GPU workers. |
@mehdidc I think this is actually necessary or else you can get different val perf when different numbers of gpus are used, e.g., see this comment: https://github.com/facebookresearch/deit/blob/main/main.py#L221-L223 |
I see, thanks @mitchellnw! OK so I need to fix this. I really thought that |
Is this argument "--distributed_evaluation" not available in the current version? |
@dmlpt Not yet, I still need to fix the val dataloader like @mitchellnw mentions and rebase on master |
Currently, evaluation is done on rank zero. This PR provides support for distributed evaluation (using an optional
--distributed-evaluation
argument) to make evaluation faster (supports both zero-shot and retrieval).