Commit ea53bcf
fix(counters): train phase was not evaluated (argonne-lcf#328)
* fix(counters): train phase was not evaluated
PR argonne-lcf#302 moved loop breaking condition from the end of the loop at its
start.
Which never fires self.stats.end_block of the current block as the
iteration never start.
Trying regulat pytorch loader from local fs:
```
[OUTPUT] 2026-02-27T06:58:50.214359 Running DLIO [Training & Evaluation] with 2 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!!
[OUTPUT] 2026-02-27T06:58:50.229669 Max steps per epoch: 128 = 1 * 1024 / 4 / 2 (samples per file * num files / batch size / comm size)
[OUTPUT] 2026-02-27T06:58:50.229764 Steps per eval: 32 = 1 * 64 / 1 / 2 (samples per file * num files / batch size eval / comm size)
[OUTPUT] 2026-02-27T06:58:50.278417 Starting epoch 1: 128 steps expected
[OUTPUT] 2026-02-27T06:58:50.278614 Starting block 1
[OUTPUT] 2026-02-27T06:59:03.743752 Ending epoch 1 - 128 steps completed in 13.47 s
[OUTPUT] 2026-02-27T06:59:03.747196 Starting eval - 32 steps expected
[OUTPUT] 2026-02-27T06:59:07.122980 Ending eval - 32 steps completed in 3.38 s
[OUTPUT] 2026-02-27T06:59:07.124598 Epoch 1 [Eval] Accelerator Utilization [AU] (%): 99.4141
[OUTPUT] 2026-02-27T06:59:07.124644 Epoch 1 [Eval] Throughput (samples/second): 18.9592
[OUTPUT] 2026-02-27T06:59:07.130596 Starting epoch 2: 128 steps expected
[OUTPUT] 2026-02-27T06:59:07.130832 Starting block 1
[OUTPUT] 2026-02-27T06:59:20.047588 Ending epoch 2 - 128 steps completed in 12.92 s
[OUTPUT] 2026-02-27T06:59:20.048553 Starting eval - 32 steps expected
[OUTPUT] 2026-02-27T06:59:23.276666 Ending eval - 32 steps completed in 3.23 s
[OUTPUT] 2026-02-27T06:59:23.277556 Epoch 2 [Eval] Accelerator Utilization [AU] (%): 99.4022
[OUTPUT] 2026-02-27T06:59:23.277595 Epoch 2 [Eval] Throughput (samples/second): 19.8261
[OUTPUT] 2026-02-27T06:59:23.280422 Starting epoch 3: 128 steps expected
[OUTPUT] 2026-02-27T06:59:23.280591 Starting block 1
[OUTPUT] 2026-02-27T06:59:36.196122 Ending epoch 3 - 128 steps completed in 12.92 s
[OUTPUT] 2026-02-27T06:59:36.197005 Starting eval - 32 steps expected
[OUTPUT] 2026-02-27T06:59:39.425806 Ending eval - 32 steps completed in 3.23 s
[OUTPUT] 2026-02-27T06:59:39.426645 Epoch 3 [Eval] Accelerator Utilization [AU] (%): 99.4032
[OUTPUT] 2026-02-27T06:59:39.426682 Epoch 3 [Eval] Throughput (samples/second): 19.8219
[OUTPUT] 2026-02-27T06:59:39.469524 Saved outputs in /lus/flare/projects/DAOS_Testing/PAP166/hydra_log/default/2026-02-27-06-58-50
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 2
[METRIC] Training Accelerator Utilization [AU] (%): 0.0000 (0.0000)
[METRIC] Training Throughput (samples/second): 0.0000 (0.0000)
[METRIC] Training I/O Throughput (MB/second): 0.0000 (0.0000)
[METRIC] train_au_meet_expectation: fail
[METRIC] Eval Accelerator Utilization [AU] (%): 49.7048 (0.0028)
[METRIC] Eval Throughput (samples/second): 9.765259 (0.206374)
[METRIC] Eval Throughput (MB/second): 0.038146 (0.000806)
[METRIC] eval_au_meet_expectation: fail
[METRIC] ==========================================================
[OUTPUT] 2026-02-27T06:59:39.484237 outputs saved in RANKID_output.json
```
Notice that logs are only show starting of the block and never its
ending.
After the fix:
```
[OUTPUT] 2026-02-28T12:30:28.000590 Running DLIO [Training & Evaluation] with 2 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!!
[WARNING] Number of files for training in /dataset/train (4000) is more than requested (64). A subset of files will be used
[WARNING] Number of files for training in /dataset/train (4000) is more than requested (64). A subset of files will be used
[OUTPUT] 2026-02-28T12:30:28.102857 Max steps per epoch: 8 = 1 * 64 / 4 / 2 (samples per file * num files / batch size / comm size)
[OUTPUT] 2026-02-28T12:30:28.102992 Steps per eval: 4 = 1 * 8 / 1 / 2 (samples per file * num files / batch size eval / comm size)
[OUTPUT] 2026-02-28T12:30:30.572480 Starting epoch 1: 8 steps expected
[OUTPUT] 2026-02-28T12:30:30.573084 Starting block 1
[OUTPUT] 2026-02-28T12:30:30.734535 Ending block 1 - 8 steps completed in 0.16 s
[OUTPUT] 2026-02-28T12:30:30.740906 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.1428
[OUTPUT] 2026-02-28T12:30:30.740994 Epoch 1 - Block 1 [Training] Throughput (samples/second): 1753.1357
[OUTPUT] 2026-02-28T12:30:30.741060 Epoch 1 - Block 1 [Training] Computation time per step (second): 0.0000+/-0.0000 (set value: {})
[OUTPUT] 2026-02-28T12:30:30.741497 Ending epoch 1 - 8 steps completed in 0.17 s
[OUTPUT] 2026-02-28T12:30:30.742789 Starting eval - 4 steps expected
[OUTPUT] 2026-02-28T12:30:30.889307 Ending eval - 4 steps completed in 0.15 s
[OUTPUT] 2026-02-28T12:30:30.891985 Epoch 1 [Eval] Accelerator Utilization [AU] (%): 0.0720
[OUTPUT] 2026-02-28T12:30:30.892054 Epoch 1 [Eval] Throughput (samples/second): 54.6620
[OUTPUT] 2026-02-28T12:30:30.900919 Starting epoch 2: 8 steps expected
[OUTPUT] 2026-02-28T12:30:30.901249 Starting block 1
[OUTPUT] 2026-02-28T12:30:30.914273 Ending block 1 - 8 steps completed in 0.01 s
[OUTPUT] 2026-02-28T12:30:30.915472 Epoch 2 - Block 1 [Training] Accelerator Utilization [AU] (%): 1.9055
[OUTPUT] 2026-02-28T12:30:30.915541 Epoch 2 - Block 1 [Training] Throughput (samples/second): 7765.7316
[OUTPUT] 2026-02-28T12:30:30.915595 Epoch 2 - Block 1 [Training] Computation time per step (second): 0.0000+/-0.0000 (set value: {})
[OUTPUT] 2026-02-28T12:30:30.915931 Ending epoch 2 - 8 steps completed in 0.02 s
[OUTPUT] 2026-02-28T12:30:30.917061 Starting eval - 4 steps expected
[OUTPUT] 2026-02-28T12:30:30.958733 Ending eval - 4 steps completed in 0.04 s
[OUTPUT] 2026-02-28T12:30:30.959729 Epoch 2 [Eval] Accelerator Utilization [AU] (%): 0.0381
[OUTPUT] 2026-02-28T12:30:30.959768 Epoch 2 [Eval] Throughput (samples/second): 192.2493
[OUTPUT] 2026-02-28T12:30:30.960091 Starting epoch 3: 8 steps expected
[OUTPUT] 2026-02-28T12:30:30.960275 Starting block 1
[OUTPUT] 2026-02-28T12:30:30.976061 Ending block 1 - 8 steps completed in 0.02 s
[OUTPUT] 2026-02-28T12:30:30.977423 Epoch 3 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.6369
[OUTPUT] 2026-02-28T12:30:30.977483 Epoch 3 - Block 1 [Training] Throughput (samples/second): 6020.3520
[OUTPUT] 2026-02-28T12:30:30.977534 Epoch 3 - Block 1 [Training] Computation time per step (second): 0.0000+/-0.0000 (set value: {})
[OUTPUT] 2026-02-28T12:30:30.977792 Ending epoch 3 - 8 steps completed in 0.02 s
[OUTPUT] 2026-02-28T12:30:30.978884 Starting eval - 4 steps expected
[OUTPUT] 2026-02-28T12:30:30.983803 Ending eval - 4 steps completed in 0.00 s
[OUTPUT] 2026-02-28T12:30:30.984927 Epoch 3 [Eval] Accelerator Utilization [AU] (%): 1.3682
[OUTPUT] 2026-02-28T12:30:30.984986 Epoch 3 [Eval] Throughput (samples/second): 1641.1245
[OUTPUT] 2026-02-28T12:30:30.986010 Saved outputs in /home/denis/dev/enakta/dlio_benchmark/hydra_log/default/2026-02-28-12-30-25
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 2
[METRIC] Training Accelerator Utilization [AU] (%): 0.5939 (0.4129)
[METRIC] Training Throughput (samples/second): 4948.3957 (2466.6534)
[METRIC] Training I/O Throughput (MB/second): 19.3297 (9.6354)
[METRIC] train_au_meet_expectation: fail
[METRIC] Eval Accelerator Utilization [AU] (%): 0.4704 (0.5038)
[METRIC] Eval Throughput (samples/second): 444.414075 (396.070635)
[METRIC] Eval Throughput (MB/second): 1.735992 (1.547151)
[METRIC] eval_au_meet_expectation: fail
[METRIC] ==========================================================
[OUTPUT] 2026-02-28T12:30:30.987839 outputs saved in RANKID_output.json
```
Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
* fix: remove unreachable branch
Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
---------
Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
Co-authored-by: Denis Barakhtanov <denis.barahtanov@gmail.com>1 parent 57148a1 commit ea53bcf
1 file changed
+3
-3
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
331 | 331 | | |
332 | 332 | | |
333 | 333 | | |
334 | | - | |
335 | | - | |
336 | 334 | | |
337 | 335 | | |
338 | 336 | | |
| |||
361 | 359 | | |
362 | 360 | | |
363 | 361 | | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
364 | 365 | | |
365 | 366 | | |
366 | | - | |
367 | 367 | | |
368 | 368 | | |
369 | 369 | | |
| |||
0 commit comments