Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add per node metrics #136

Open
rom1504 opened this issue Jul 29, 2022 · 5 comments
Open

Add per node metrics #136

rom1504 opened this issue Jul 29, 2022 · 5 comments

Comments

@rom1504
Copy link
Collaborator

rom1504 commented Jul 29, 2022

would make it possible to debug broken nodes

not obvious how to do it

@rom1504
Copy link
Collaborator Author

rom1504 commented Jul 29, 2022

I think having forward time per node might be the most useful

@rom1504
Copy link
Collaborator Author

rom1504 commented Jul 30, 2022

I'm going to do --dataset-resampled and decrease number sample / increase number of epochs for now

@rom1504
Copy link
Collaborator Author

rom1504 commented Jul 30, 2022

I implemented that and was able to find a faulty gpu by printing host name + device + forward time
This is a necessary feature

One gpu can slow down the whole training

@rwightman
Copy link
Collaborator

@rom1504 there was some though for this in my refactoring, but I never made it fully realized, the is_master has a local arg set via args.log_local (or something similar), that will return true for local rank =0 instead of global, I was originally just using it for logging at startup (debugging juwels startup issues),

it could be extended to support the train loop log

if is_master(args) and (i % 100 == 0 or batch_count == num_batches_per_epoch):
are changed to include is_master(args, local=args.log_local)

but q, won't the slow node cause the step for all nodes to be slow since the ddp backprop sync is blocking? you'd need another time pre backprop no?

@rom1504
Copy link
Collaborator Author

rom1504 commented Aug 5, 2022

Yes i timed only the forward for this reason

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants