Skip to content

Commit c44127e

Browse files
Delaunaypierre.delaunayjohn2john2Pierre Delaunay
authored
New Inference benchmarks (#382)
* Improve issue reporting (#368) * Add job push button (#370) * Add job push button * Add status helper to avoid rsync when cache is good * Tweak milabench realtime event tracking * Add unmerged code * Tweak to enable perfectly forwarding milabench events to external soures * Make sure multinode jobs run on the right directory * Sync with upstream * Use milabench utilities to create the milaench container_run cli * Patch tqdm to avoid flodding logs with meaningless progress update * Try to ensure the logs are flushed when an issue happens to avoid slurm eating some lines * Add timed log flush * Tweak timed flush * Add new inspection routes * Update script to use faster setup * New shared_prepare * Tweak milabench global patch installation * Add SQL a valid metric pusher * Add SSH debug loging for tunnels * Handle database reconnection gracefully * refactor configuration resolution (#376) * Client Server bench concept * new vLLM inference benchmark * new vllm and whisper bench * New inference benchmark for flux and whisper * Flux inference bench * New Text generation benchmark * Normalize slurm configuration names * Print a warning if we have no profile set * fno_bench initial * Move server code to the dashboard repository * additional fix:1 * arg parsing * Fix some issues with milabench new not replacing some placeholder values * Tweaks to huggingface environment folders * Toggle the inference benchmarks on * Added some TimedIterator tests * Add some checks to the timed iterator * Add the option to fetch the first batch in TimedIterator to reduce measurements variation * Adding milabench to ngc container * Prepare tweaks * Refactor to use generic huggingface download model and dataset * Tweak the prepare script to use no split by defaults * Add a new 'all' config * Force HF_HUB_CACHE to unify behaviour between benchmarks * Pin dependencies for new benchmarks * Avoid huggingface for Whisper inference * Tweak batch sizes * SPARK Tweaks * vllm sweep concept * Add a new IPMI monitor that starts and ends with milabench run * Add new Kj energy spent estimate * Put real time attached to cuda event * Updated pin to torch==2.8 * Implement global throughput sampling * update benchrun to match new pytorchrun API * Add new GPU Poll override * New timeline script to display batch id * Add Energy stats guard on division by zero * Tweak error report to not break exception trace lines * More robust stracktrace extraction * Fix backward compatibility problem with pytorch 2.8 & 2.9 * Update JAX libraries * Tweaks to support latest version of dependencies * Make sure IP is set for the IPMI monitor * Tweaks for full run * - * Restore time estimate of the rate time * update gpu_poll to be float * Gatehr tweaks * New milabench event processor * report tweaking * New reporting functions * Consolidate configs inside SystemConfig * Tweak the unified config setting * Full milabench resume implementation * add new dense llm sweep * Ignore unmerged extension * Update code to use the new system structure * Fix for docker not parsing the version file correctly * Tweak IMPI monitor to be an op when not set --------- Co-authored-by: pierre.delaunay <delaunap@rtx5.server.mila.quebec> Co-authored-by: john2 <john2@jrlogin08.jureca> Co-authored-by: john2 <john2@jrlogin09.jureca> Co-authored-by: Pierre Delaunay <pierre.delaunay@mila.quebec>
1 parent f528806 commit c44127e

203 files changed

Lines changed: 10773 additions & 6741 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,3 +63,4 @@ benchmarks/*/src/
6363

6464
*.new.yml
6565
*.png
66+
fjobs_*.json

0 commit comments

Comments
 (0)