Add per node metrics #136

rom1504 · 2022-07-29T22:33:14Z

would make it possible to debug broken nodes

not obvious how to do it

rom1504 · 2022-07-29T22:50:24Z

I think having forward time per node might be the most useful

rom1504 · 2022-07-30T08:01:29Z

I'm going to do --dataset-resampled and decrease number sample / increase number of epochs for now

rom1504 · 2022-07-30T09:06:16Z

I implemented that and was able to find a faulty gpu by printing host name + device + forward time
This is a necessary feature

One gpu can slow down the whole training

rwightman · 2022-08-05T18:27:51Z

@rom1504 there was some though for this in my refactoring, but I never made it fully realized, the is_master has a local arg set via args.log_local (or something similar), that will return true for local rank =0 instead of global, I was originally just using it for logging at startup (debugging juwels startup issues),

it could be extended to support the train loop log

open_clip/src/training/train.py

Line 105 in 15bb1f7

if is_master(args) and (i % 100 == 0 or batch_count == num_batches_per_epoch):

are changed to include is_master(args, local=args.log_local)

but q, won't the slow node cause the step for all nodes to be slow since the ddp backprop sync is blocking? you'd need another time pre backprop no?

rom1504 · 2022-08-05T18:30:34Z

Yes i timed only the forward for this reason

rom1504 added the new feature label Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per node metrics #136

Add per node metrics #136

rom1504 commented Jul 29, 2022

rom1504 commented Jul 29, 2022

rom1504 commented Jul 30, 2022

rom1504 commented Jul 30, 2022

rwightman commented Aug 5, 2022

rom1504 commented Aug 5, 2022

Add per node metrics #136

Add per node metrics #136

Comments

rom1504 commented Jul 29, 2022

rom1504 commented Jul 29, 2022

rom1504 commented Jul 30, 2022

rom1504 commented Jul 30, 2022

rwightman commented Aug 5, 2022

rom1504 commented Aug 5, 2022