-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add per node metrics #136
Comments
I think having forward time per node might be the most useful |
I'm going to do |
I implemented that and was able to find a faulty gpu by printing host name + device + forward time One gpu can slow down the whole training |
@rom1504 there was some though for this in my refactoring, but I never made it fully realized, the is_master has a local arg set via args.log_local (or something similar), that will return true for local rank =0 instead of global, I was originally just using it for logging at startup (debugging juwels startup issues), it could be extended to support the train loop log open_clip/src/training/train.py Line 105 in 15bb1f7
is_master(args, local=args.log_local)
but q, won't the slow node cause the step for all nodes to be slow since the ddp backprop sync is blocking? you'd need another time pre backprop no? |
Yes i timed only the forward for this reason |
would make it possible to debug broken nodes
not obvious how to do it
The text was updated successfully, but these errors were encountered: