fix(counters): train phase was not evaluated (argonne-lcf#328)

enakta · 0xE0F · web-flow · commit ea53bcfe26da · 2026-03-13T18:17:57.000-05:00
* fix(counters): train phase was not evaluated PR argonne-lcf#302 moved loop breaking condition from the end of the loop at its start. Which never fires self.stats.end_block of the current block as the iteration never start. Trying regulat pytorch loader from local fs: ``` [OUTPUT] 2026-02-27T06:58:50.214359 Running DLIO [Training & Evaluation] with 2 process(es) [WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!! [OUTPUT] 2026-02-27T06:58:50.229669 Max steps per epoch: 128 = 1 * 1024 / 4 / 2 (samples per file * num files / batch size / comm size) [OUTPUT] 2026-02-27T06:58:50.229764 Steps per eval: 32 = 1 * 64 / 1 / 2 (samples per file * num files / batch size eval / comm size) [OUTPUT] 2026-02-27T06:58:50.278417 Starting epoch 1: 128 steps expected [OUTPUT] 2026-02-27T06:58:50.278614 Starting block 1 [OUTPUT] 2026-02-27T06:59:03.743752 Ending epoch 1 - 128 steps completed in 13.47 s [OUTPUT] 2026-02-27T06:59:03.747196 Starting eval - 32 steps expected [OUTPUT] 2026-02-27T06:59:07.122980 Ending eval - 32 steps completed in 3.38 s [OUTPUT] 2026-02-27T06:59:07.124598 Epoch 1 [Eval] Accelerator Utilization [AU] (%): 99.4141 [OUTPUT] 2026-02-27T06:59:07.124644 Epoch 1 [Eval] Throughput (samples/second): 18.9592 [OUTPUT] 2026-02-27T06:59:07.130596 Starting epoch 2: 128 steps expected [OUTPUT] 2026-02-27T06:59:07.130832 Starting block 1 [OUTPUT] 2026-02-27T06:59:20.047588 Ending epoch 2 - 128 steps completed in 12.92 s [OUTPUT] 2026-02-27T06:59:20.048553 Starting eval - 32 steps expected [OUTPUT] 2026-02-27T06:59:23.276666 Ending eval - 32 steps completed in 3.23 s [OUTPUT] 2026-02-27T06:59:23.277556 Epoch 2 [Eval] Accelerator Utilization [AU] (%): 99.4022 [OUTPUT] 2026-02-27T06:59:23.277595 Epoch 2 [Eval] Throughput (samples/second): 19.8261 [OUTPUT] 2026-02-27T06:59:23.280422 Starting epoch 3: 128 steps expected [OUTPUT] 2026-02-27T06:59:23.280591 Starting block 1 [OUTPUT] 2026-02-27T06:59:36.196122 Ending epoch 3 - 128 steps completed in 12.92 s [OUTPUT] 2026-02-27T06:59:36.197005 Starting eval - 32 steps expected [OUTPUT] 2026-02-27T06:59:39.425806 Ending eval - 32 steps completed in 3.23 s [OUTPUT] 2026-02-27T06:59:39.426645 Epoch 3 [Eval] Accelerator Utilization [AU] (%): 99.4032 [OUTPUT] 2026-02-27T06:59:39.426682 Epoch 3 [Eval] Throughput (samples/second): 19.8219 [OUTPUT] 2026-02-27T06:59:39.469524 Saved outputs in /lus/flare/projects/DAOS_Testing/PAP166/hydra_log/default/2026-02-27-06-58-50 [OUTPUT] Averaged metric over all steps/epochs [METRIC] ========================================================== [METRIC] Number of Simulated Accelerators: 2 [METRIC] Training Accelerator Utilization [AU] (%): 0.0000 (0.0000) [METRIC] Training Throughput (samples/second): 0.0000 (0.0000) [METRIC] Training I/O Throughput (MB/second): 0.0000 (0.0000) [METRIC] train_au_meet_expectation: fail [METRIC] Eval Accelerator Utilization [AU] (%): 49.7048 (0.0028) [METRIC] Eval Throughput (samples/second): 9.765259 (0.206374) [METRIC] Eval Throughput (MB/second): 0.038146 (0.000806) [METRIC] eval_au_meet_expectation: fail [METRIC] ========================================================== [OUTPUT] 2026-02-27T06:59:39.484237 outputs saved in RANKID_output.json ``` Notice that logs are only show starting of the block and never its ending. After the fix: ``` [OUTPUT] 2026-02-28T12:30:28.000590 Running DLIO [Training & Evaluation] with 2 process(es) [WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!! [WARNING] Number of files for training in /dataset/train (4000) is more than requested (64). A subset of files will be used [WARNING] Number of files for training in /dataset/train (4000) is more than requested (64). A subset of files will be used [OUTPUT] 2026-02-28T12:30:28.102857 Max steps per epoch: 8 = 1 * 64 / 4 / 2 (samples per file * num files / batch size / comm size) [OUTPUT] 2026-02-28T12:30:28.102992 Steps per eval: 4 = 1 * 8 / 1 / 2 (samples per file * num files / batch size eval / comm size) [OUTPUT] 2026-02-28T12:30:30.572480 Starting epoch 1: 8 steps expected [OUTPUT] 2026-02-28T12:30:30.573084 Starting block 1 [OUTPUT] 2026-02-28T12:30:30.734535 Ending block 1 - 8 steps completed in 0.16 s [OUTPUT] 2026-02-28T12:30:30.740906 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.1428 [OUTPUT] 2026-02-28T12:30:30.740994 Epoch 1 - Block 1 [Training] Throughput (samples/second): 1753.1357 [OUTPUT] 2026-02-28T12:30:30.741060 Epoch 1 - Block 1 [Training] Computation time per step (second): 0.0000+/-0.0000 (set value: {}) [OUTPUT] 2026-02-28T12:30:30.741497 Ending epoch 1 - 8 steps completed in 0.17 s [OUTPUT] 2026-02-28T12:30:30.742789 Starting eval - 4 steps expected [OUTPUT] 2026-02-28T12:30:30.889307 Ending eval - 4 steps completed in 0.15 s [OUTPUT] 2026-02-28T12:30:30.891985 Epoch 1 [Eval] Accelerator Utilization [AU] (%): 0.0720 [OUTPUT] 2026-02-28T12:30:30.892054 Epoch 1 [Eval] Throughput (samples/second): 54.6620 [OUTPUT] 2026-02-28T12:30:30.900919 Starting epoch 2: 8 steps expected [OUTPUT] 2026-02-28T12:30:30.901249 Starting block 1 [OUTPUT] 2026-02-28T12:30:30.914273 Ending block 1 - 8 steps completed in 0.01 s [OUTPUT] 2026-02-28T12:30:30.915472 Epoch 2 - Block 1 [Training] Accelerator Utilization [AU] (%): 1.9055 [OUTPUT] 2026-02-28T12:30:30.915541 Epoch 2 - Block 1 [Training] Throughput (samples/second): 7765.7316 [OUTPUT] 2026-02-28T12:30:30.915595 Epoch 2 - Block 1 [Training] Computation time per step (second): 0.0000+/-0.0000 (set value: {}) [OUTPUT] 2026-02-28T12:30:30.915931 Ending epoch 2 - 8 steps completed in 0.02 s [OUTPUT] 2026-02-28T12:30:30.917061 Starting eval - 4 steps expected [OUTPUT] 2026-02-28T12:30:30.958733 Ending eval - 4 steps completed in 0.04 s [OUTPUT] 2026-02-28T12:30:30.959729 Epoch 2 [Eval] Accelerator Utilization [AU] (%): 0.0381 [OUTPUT] 2026-02-28T12:30:30.959768 Epoch 2 [Eval] Throughput (samples/second): 192.2493 [OUTPUT] 2026-02-28T12:30:30.960091 Starting epoch 3: 8 steps expected [OUTPUT] 2026-02-28T12:30:30.960275 Starting block 1 [OUTPUT] 2026-02-28T12:30:30.976061 Ending block 1 - 8 steps completed in 0.02 s [OUTPUT] 2026-02-28T12:30:30.977423 Epoch 3 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.6369 [OUTPUT] 2026-02-28T12:30:30.977483 Epoch 3 - Block 1 [Training] Throughput (samples/second): 6020.3520 [OUTPUT] 2026-02-28T12:30:30.977534 Epoch 3 - Block 1 [Training] Computation time per step (second): 0.0000+/-0.0000 (set value: {}) [OUTPUT] 2026-02-28T12:30:30.977792 Ending epoch 3 - 8 steps completed in 0.02 s [OUTPUT] 2026-02-28T12:30:30.978884 Starting eval - 4 steps expected [OUTPUT] 2026-02-28T12:30:30.983803 Ending eval - 4 steps completed in 0.00 s [OUTPUT] 2026-02-28T12:30:30.984927 Epoch 3 [Eval] Accelerator Utilization [AU] (%): 1.3682 [OUTPUT] 2026-02-28T12:30:30.984986 Epoch 3 [Eval] Throughput (samples/second): 1641.1245 [OUTPUT] 2026-02-28T12:30:30.986010 Saved outputs in /home/denis/dev/enakta/dlio_benchmark/hydra_log/default/2026-02-28-12-30-25 [OUTPUT] Averaged metric over all steps/epochs [METRIC] ========================================================== [METRIC] Number of Simulated Accelerators: 2 [METRIC] Training Accelerator Utilization [AU] (%): 0.5939 (0.4129) [METRIC] Training Throughput (samples/second): 4948.3957 (2466.6534) [METRIC] Training I/O Throughput (MB/second): 19.3297 (9.6354) [METRIC] train_au_meet_expectation: fail [METRIC] Eval Accelerator Utilization [AU] (%): 0.4704 (0.5038) [METRIC] Eval Throughput (samples/second): 444.414075 (396.070635) [METRIC] Eval Throughput (MB/second): 1.735992 (1.547151) [METRIC] eval_au_meet_expectation: fail [METRIC] ========================================================== [OUTPUT] 2026-02-28T12:30:30.987839 outputs saved in RANKID_output.json ``` Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com> * fix: remove unreachable branch Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com> --------- Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com> Co-authored-by: Denis Barakhtanov <denis.barahtanov@gmail.com>
diff --git a/dlio_benchmark/main.py b/dlio_benchmark/main.py
@@ -331,8 +331,6 @@ def _train(self, epoch):
             if overall_step > max_steps or ((self.total_training_steps > 0) and (overall_step > self.total_training_steps)):
                 if self.args.my_rank == 0:
                     self.logger.info(f"{utcnow()} Maximum number of steps reached")
-                if (block_step != 1 and self.do_checkpoint) or (not self.do_checkpoint):
-                    self.stats.end_block(epoch, block, block_step - 1)
                 break
             self.stats.batch_loaded(epoch, overall_step, block)
             computation_time = self.args.computation_time
@@ -361,9 +359,11 @@ def _train(self, epoch):
                 self.stats.start_block(epoch, block)
             self.stats.start_loading()
 
+        # Always closes the current block. It is safe to call end_block for already ended block, as there's a guard inside.
+        self.stats.end_block(epoch, block, block_step - 1)
+
         self.comm.barrier()
         if self.do_checkpoint and (self.steps_between_checkpoints < 0) and (epoch == self.next_checkpoint_epoch):
-            self.stats.end_block(epoch, block, block_step-1)
             self.stats.start_save_ckpt(epoch, block, overall_step-1)
             self.checkpointing_mechanism.save_checkpoint(epoch, overall_step)
             self.stats.end_save_ckpt(epoch, block)