Commit 8ef59dc
Sync: Reset to upstream/main (#4)
* refactor: convert direct imports to lazy imports in profiler_factory (argonne-lcf#325)
- Move profiler imports inside get_profiler() method
- Benefits:
- Avoids loading TFProfiler (which imports tensorflow) unless needed
- Reduces import overhead for users not using TENSORBOARD profiler
- Default profiler (IOSTAT) no longer triggers tensorflow import
- No breaking changes - same API, same behavior
* feat: add native AIStore storage backend (argonne-lcf#321)
Add a native AIStore storage handler that uses the official AIStore
Python SDK for direct access, bypassing the S3 compatibility layer
for better performance and simpler configuration.
Changes:
- Add AIStoreStorage class with full CRUD operations, range reads,
and prefix-based object listing
- Add StorageType.AISTORE enum and wire it through StorageFactory,
GeneratorFactory, and ReaderFactory (reuses S3 generators/readers)
- Add AIStore endpoint configuration support in ConfigArguments
- Add 'aistore' optional dependency in setup.py
- Add mock-based test suite with full AIStore SDK simulation
- Add CI workflow for AIStore tests
- Add storage configuration section to documentation
Supported formats: NPY, NPZ, JPEG
Supported frameworks: PyTorch, TensorFlow
Signed-off-by: Abhishek Gaikwad <gaikwadabhishek1997@gmail.com>
* fix(counters): train phase was not evaluated (argonne-lcf#328)
* fix(counters): train phase was not evaluated
PR argonne-lcf#302 moved loop breaking condition from the end of the loop at its
start.
Which never fires self.stats.end_block of the current block as the
iteration never start.
Trying regulat pytorch loader from local fs:
```
[OUTPUT] 2026-02-27T06:58:50.214359 Running DLIO [Training & Evaluation] with 2 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!!
[OUTPUT] 2026-02-27T06:58:50.229669 Max steps per epoch: 128 = 1 * 1024 / 4 / 2 (samples per file * num files / batch size / comm size)
[OUTPUT] 2026-02-27T06:58:50.229764 Steps per eval: 32 = 1 * 64 / 1 / 2 (samples per file * num files / batch size eval / comm size)
[OUTPUT] 2026-02-27T06:58:50.278417 Starting epoch 1: 128 steps expected
[OUTPUT] 2026-02-27T06:58:50.278614 Starting block 1
[OUTPUT] 2026-02-27T06:59:03.743752 Ending epoch 1 - 128 steps completed in 13.47 s
[OUTPUT] 2026-02-27T06:59:03.747196 Starting eval - 32 steps expected
[OUTPUT] 2026-02-27T06:59:07.122980 Ending eval - 32 steps completed in 3.38 s
[OUTPUT] 2026-02-27T06:59:07.124598 Epoch 1 [Eval] Accelerator Utilization [AU] (%): 99.4141
[OUTPUT] 2026-02-27T06:59:07.124644 Epoch 1 [Eval] Throughput (samples/second): 18.9592
[OUTPUT] 2026-02-27T06:59:07.130596 Starting epoch 2: 128 steps expected
[OUTPUT] 2026-02-27T06:59:07.130832 Starting block 1
[OUTPUT] 2026-02-27T06:59:20.047588 Ending epoch 2 - 128 steps completed in 12.92 s
[OUTPUT] 2026-02-27T06:59:20.048553 Starting eval - 32 steps expected
[OUTPUT] 2026-02-27T06:59:23.276666 Ending eval - 32 steps completed in 3.23 s
[OUTPUT] 2026-02-27T06:59:23.277556 Epoch 2 [Eval] Accelerator Utilization [AU] (%): 99.4022
[OUTPUT] 2026-02-27T06:59:23.277595 Epoch 2 [Eval] Throughput (samples/second): 19.8261
[OUTPUT] 2026-02-27T06:59:23.280422 Starting epoch 3: 128 steps expected
[OUTPUT] 2026-02-27T06:59:23.280591 Starting block 1
[OUTPUT] 2026-02-27T06:59:36.196122 Ending epoch 3 - 128 steps completed in 12.92 s
[OUTPUT] 2026-02-27T06:59:36.197005 Starting eval - 32 steps expected
[OUTPUT] 2026-02-27T06:59:39.425806 Ending eval - 32 steps completed in 3.23 s
[OUTPUT] 2026-02-27T06:59:39.426645 Epoch 3 [Eval] Accelerator Utilization [AU] (%): 99.4032
[OUTPUT] 2026-02-27T06:59:39.426682 Epoch 3 [Eval] Throughput (samples/second): 19.8219
[OUTPUT] 2026-02-27T06:59:39.469524 Saved outputs in /lus/flare/projects/DAOS_Testing/PAP166/hydra_log/default/2026-02-27-06-58-50
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 2
[METRIC] Training Accelerator Utilization [AU] (%): 0.0000 (0.0000)
[METRIC] Training Throughput (samples/second): 0.0000 (0.0000)
[METRIC] Training I/O Throughput (MB/second): 0.0000 (0.0000)
[METRIC] train_au_meet_expectation: fail
[METRIC] Eval Accelerator Utilization [AU] (%): 49.7048 (0.0028)
[METRIC] Eval Throughput (samples/second): 9.765259 (0.206374)
[METRIC] Eval Throughput (MB/second): 0.038146 (0.000806)
[METRIC] eval_au_meet_expectation: fail
[METRIC] ==========================================================
[OUTPUT] 2026-02-27T06:59:39.484237 outputs saved in RANKID_output.json
```
Notice that logs are only show starting of the block and never its
ending.
After the fix:
```
[OUTPUT] 2026-02-28T12:30:28.000590 Running DLIO [Training & Evaluation] with 2 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!!
[WARNING] Number of files for training in /dataset/train (4000) is more than requested (64). A subset of files will be used
[WARNING] Number of files for training in /dataset/train (4000) is more than requested (64). A subset of files will be used
[OUTPUT] 2026-02-28T12:30:28.102857 Max steps per epoch: 8 = 1 * 64 / 4 / 2 (samples per file * num files / batch size / comm size)
[OUTPUT] 2026-02-28T12:30:28.102992 Steps per eval: 4 = 1 * 8 / 1 / 2 (samples per file * num files / batch size eval / comm size)
[OUTPUT] 2026-02-28T12:30:30.572480 Starting epoch 1: 8 steps expected
[OUTPUT] 2026-02-28T12:30:30.573084 Starting block 1
[OUTPUT] 2026-02-28T12:30:30.734535 Ending block 1 - 8 steps completed in 0.16 s
[OUTPUT] 2026-02-28T12:30:30.740906 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.1428
[OUTPUT] 2026-02-28T12:30:30.740994 Epoch 1 - Block 1 [Training] Throughput (samples/second): 1753.1357
[OUTPUT] 2026-02-28T12:30:30.741060 Epoch 1 - Block 1 [Training] Computation time per step (second): 0.0000+/-0.0000 (set value: {})
[OUTPUT] 2026-02-28T12:30:30.741497 Ending epoch 1 - 8 steps completed in 0.17 s
[OUTPUT] 2026-02-28T12:30:30.742789 Starting eval - 4 steps expected
[OUTPUT] 2026-02-28T12:30:30.889307 Ending eval - 4 steps completed in 0.15 s
[OUTPUT] 2026-02-28T12:30:30.891985 Epoch 1 [Eval] Accelerator Utilization [AU] (%): 0.0720
[OUTPUT] 2026-02-28T12:30:30.892054 Epoch 1 [Eval] Throughput (samples/second): 54.6620
[OUTPUT] 2026-02-28T12:30:30.900919 Starting epoch 2: 8 steps expected
[OUTPUT] 2026-02-28T12:30:30.901249 Starting block 1
[OUTPUT] 2026-02-28T12:30:30.914273 Ending block 1 - 8 steps completed in 0.01 s
[OUTPUT] 2026-02-28T12:30:30.915472 Epoch 2 - Block 1 [Training] Accelerator Utilization [AU] (%): 1.9055
[OUTPUT] 2026-02-28T12:30:30.915541 Epoch 2 - Block 1 [Training] Throughput (samples/second): 7765.7316
[OUTPUT] 2026-02-28T12:30:30.915595 Epoch 2 - Block 1 [Training] Computation time per step (second): 0.0000+/-0.0000 (set value: {})
[OUTPUT] 2026-02-28T12:30:30.915931 Ending epoch 2 - 8 steps completed in 0.02 s
[OUTPUT] 2026-02-28T12:30:30.917061 Starting eval - 4 steps expected
[OUTPUT] 2026-02-28T12:30:30.958733 Ending eval - 4 steps completed in 0.04 s
[OUTPUT] 2026-02-28T12:30:30.959729 Epoch 2 [Eval] Accelerator Utilization [AU] (%): 0.0381
[OUTPUT] 2026-02-28T12:30:30.959768 Epoch 2 [Eval] Throughput (samples/second): 192.2493
[OUTPUT] 2026-02-28T12:30:30.960091 Starting epoch 3: 8 steps expected
[OUTPUT] 2026-02-28T12:30:30.960275 Starting block 1
[OUTPUT] 2026-02-28T12:30:30.976061 Ending block 1 - 8 steps completed in 0.02 s
[OUTPUT] 2026-02-28T12:30:30.977423 Epoch 3 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.6369
[OUTPUT] 2026-02-28T12:30:30.977483 Epoch 3 - Block 1 [Training] Throughput (samples/second): 6020.3520
[OUTPUT] 2026-02-28T12:30:30.977534 Epoch 3 - Block 1 [Training] Computation time per step (second): 0.0000+/-0.0000 (set value: {})
[OUTPUT] 2026-02-28T12:30:30.977792 Ending epoch 3 - 8 steps completed in 0.02 s
[OUTPUT] 2026-02-28T12:30:30.978884 Starting eval - 4 steps expected
[OUTPUT] 2026-02-28T12:30:30.983803 Ending eval - 4 steps completed in 0.00 s
[OUTPUT] 2026-02-28T12:30:30.984927 Epoch 3 [Eval] Accelerator Utilization [AU] (%): 1.3682
[OUTPUT] 2026-02-28T12:30:30.984986 Epoch 3 [Eval] Throughput (samples/second): 1641.1245
[OUTPUT] 2026-02-28T12:30:30.986010 Saved outputs in /home/denis/dev/enakta/dlio_benchmark/hydra_log/default/2026-02-28-12-30-25
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 2
[METRIC] Training Accelerator Utilization [AU] (%): 0.5939 (0.4129)
[METRIC] Training Throughput (samples/second): 4948.3957 (2466.6534)
[METRIC] Training I/O Throughput (MB/second): 19.3297 (9.6354)
[METRIC] train_au_meet_expectation: fail
[METRIC] Eval Accelerator Utilization [AU] (%): 0.4704 (0.5038)
[METRIC] Eval Throughput (samples/second): 444.414075 (396.070635)
[METRIC] Eval Throughput (MB/second): 1.735992 (1.547151)
[METRIC] eval_au_meet_expectation: fail
[METRIC] ==========================================================
[OUTPUT] 2026-02-28T12:30:30.987839 outputs saved in RANKID_output.json
```
Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
* fix: remove unreachable branch
Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
---------
Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
Co-authored-by: Denis Barakhtanov <denis.barahtanov@gmail.com>
* refactor(generators): unify generators to work with any storage backend (argonne-lcf#329)
Every new storage backend required copy-pasting each generator into an
_XXX sibling file: npz_generator_s3.py, npy_generator_s3.py and so on.
The only difference was whether to write the output locally on disk,
directly via numpy/PIL, or via the storage interface.
This makes the pattern unsustainable: two duplicated formats today, more
with each new backend — incurring a significant maintenance burden.
Since all generators already had a storage instance and used it to
generate file names, we can leverage it.
The only set of generators now can check if the stroage is locally available
via `islocalfs` and use some optimisation, if any. If the storage is not local,
the sample serializes to io.BytesIO, call buf.getvalue(), and
delegate to self.storage.put_data().
All storage backends receive plain bytes as designed by the storage interface,
removing type inspection, seek() and getvalue() calls scattered across backends.
- FileStorage.put_data was never called, had text-mode open and a double
get_uri call (once from the generator, once inside put_data itself).
Now it is the default write path for LOCAL_FS, used by almost every
workload config. get_data aligned to binary mode ("rb") for consistency.
- AIStoreStorage.put_data: remove isinstance dispatch, accept bytes directly.
- S3TorchStorage.put_data: remove data.getvalue() — just write data.
- generator_factory: removed S3/AIStore branching for NPZ, NPY, JPEG.
- factory referenced jpeg_generator_s3.JPEGGeneratorS3 which never existed;
JPEG + S3/AIStore would crash at import time.
After this patch, adding a new storage backend requires no changes in any
generator. Adding a new data format automatically works with all backends.
Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
Co-authored-by: Denis Barakhtanov <denis.barahtanov@gmail.com>
---------
Signed-off-by: Abhishek Gaikwad <gaikwadabhishek1997@gmail.com>
Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
Co-authored-by: Izzet Yildirim <yildirim2@llnl.gov>
Co-authored-by: Abhishek Gaikwad <gaikwadabhishek1997@gmail.com>
Co-authored-by: enakta <140368024+enakta@users.noreply.github.com>
Co-authored-by: Denis Barakhtanov <denis.barahtanov@gmail.com>1 parent 7017ba2 commit 8ef59dc
File tree
25 files changed
+954
-155
lines changed- .github/workflows
- dlio_benchmark
- common
- data_generator
- profiler
- reader
- storage
- utils
- docs/source
- tests
25 files changed
+954
-155
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
358 | 358 | | |
359 | 359 | | |
360 | 360 | | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
20 | 28 | | |
21 | 29 | | |
22 | 30 | | |
| |||
150 | 158 | | |
151 | 159 | | |
152 | 160 | | |
153 | | - | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
154 | 164 | | |
155 | 165 | | |
156 | 166 | | |
| |||
163 | 173 | | |
164 | 174 | | |
165 | 175 | | |
166 | | - | |
| 176 | + | |
167 | 177 | | |
168 | 178 | | |
169 | 179 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
| 61 | + | |
61 | 62 | | |
62 | 63 | | |
63 | 64 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
17 | | - | |
18 | | - | |
19 | | - | |
| 17 | + | |
20 | 18 | | |
21 | 19 | | |
22 | 20 | | |
| |||
25 | 23 | | |
26 | 24 | | |
27 | 25 | | |
28 | | - | |
29 | 26 | | |
30 | 27 | | |
31 | 28 | | |
| |||
36 | 33 | | |
37 | 34 | | |
38 | 35 | | |
39 | | - | |
40 | | - | |
41 | | - | |
42 | | - | |
43 | | - | |
44 | | - | |
| 36 | + | |
| 37 | + | |
45 | 38 | | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
| 39 | + | |
| 40 | + | |
52 | 41 | | |
53 | 42 | | |
54 | 43 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
| |||
53 | 54 | | |
54 | 55 | | |
55 | 56 | | |
56 | | - | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
57 | 61 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
| |||
49 | 50 | | |
50 | 51 | | |
51 | 52 | | |
52 | | - | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
53 | 57 | | |
This file was deleted.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
| |||
48 | 49 | | |
49 | 50 | | |
50 | 51 | | |
| 52 | + | |
51 | 53 | | |
52 | | - | |
| 54 | + | |
53 | 55 | | |
54 | | - | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
55 | 59 | | |
This file was deleted.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
| |||
49 | 50 | | |
50 | 51 | | |
51 | 52 | | |
52 | | - | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
53 | 57 | | |
0 commit comments