Commit c44127e

authored

New Inference benchmarks (#382)

* Improve issue reporting (#368) * Add job push button (#370) * Add job push button * Add status helper to avoid rsync when cache is good * Tweak milabench realtime event tracking * Add unmerged code * Tweak to enable perfectly forwarding milabench events to external soures * Make sure multinode jobs run on the right directory * Sync with upstream * Use milabench utilities to create the milaench container_run cli * Patch tqdm to avoid flodding logs with meaningless progress update * Try to ensure the logs are flushed when an issue happens to avoid slurm eating some lines * Add timed log flush * Tweak timed flush * Add new inspection routes * Update script to use faster setup * New shared_prepare * Tweak milabench global patch installation * Add SQL a valid metric pusher * Add SSH debug loging for tunnels * Handle database reconnection gracefully * refactor configuration resolution (#376) * Client Server bench concept * new vLLM inference benchmark * new vllm and whisper bench * New inference benchmark for flux and whisper * Flux inference bench * New Text generation benchmark * Normalize slurm configuration names * Print a warning if we have no profile set * fno_bench initial * Move server code to the dashboard repository * additional fix:1 * arg parsing * Fix some issues with milabench new not replacing some placeholder values * Tweaks to huggingface environment folders * Toggle the inference benchmarks on * Added some TimedIterator tests * Add some checks to the timed iterator * Add the option to fetch the first batch in TimedIterator to reduce measurements variation * Adding milabench to ngc container * Prepare tweaks * Refactor to use generic huggingface download model and dataset * Tweak the prepare script to use no split by defaults * Add a new 'all' config * Force HF_HUB_CACHE to unify behaviour between benchmarks * Pin dependencies for new benchmarks * Avoid huggingface for Whisper inference * Tweak batch sizes * SPARK Tweaks * vllm sweep concept * Add a new IPMI monitor that starts and ends with milabench run * Add new Kj energy spent estimate * Put real time attached to cuda event * Updated pin to torch==2.8 * Implement global throughput sampling * update benchrun to match new pytorchrun API * Add new GPU Poll override * New timeline script to display batch id * Add Energy stats guard on division by zero * Tweak error report to not break exception trace lines * More robust stracktrace extraction * Fix backward compatibility problem with pytorch 2.8 & 2.9 * Update JAX libraries * Tweaks to support latest version of dependencies * Make sure IP is set for the IPMI monitor * Tweaks for full run * - * Restore time estimate of the rate time * update gpu_poll to be float * Gatehr tweaks * New milabench event processor * report tweaking * New reporting functions * Consolidate configs inside SystemConfig * Tweak the unified config setting * Full milabench resume implementation * add new dense llm sweep * Ignore unmerged extension * Update code to use the new system structure * Fix for docker not parsing the version file correctly * Tweak IMPI monitor to be an op when not set --------- Co-authored-by: pierre.delaunay <delaunap@rtx5.server.mila.quebec> Co-authored-by: john2 <john2@jrlogin08.jureca> Co-authored-by: john2 <john2@jrlogin09.jureca> Co-authored-by: Pierre Delaunay <pierre.delaunay@mila.quebec>

1 parent f528806 commit c44127eCopy full SHA for c44127e

203 files changed

.gitignore
.pin
- constraints-cuda-torch.txt
Makefile
README.md
benchmarks
- _templates
  - simple
    - dev.yaml
    - voirfile.py
  - stdout
    - README.md
    - dev.yaml
  - voir
    - dev.yaml
    - voirfile.py
- accelerate_opt
  - voirfile.py
- brax
  - requirements.cuda.txt
  - voirfile.py
- cleanrl_jax
  - voirfile.py
- diffusion
  - benchfile.py
  - dev.yaml
  - main.py
  - prepare.py
  - requirements.cuda.txt
- dinov2
  - prepare.py
  - requirements.cuda.txt
  - voirfile.py
- dlrm
  - voirfile.py
- flops
  - main.py
  - requirements.cuda.txt
- fno_benchmark
  - Makefile
  - README.md
  - benchfile.py
  - dev.yaml
  - prepare.py
  - requirements.in
  - voirfile.py
- geo_gnn
  - pcqm4m_subset.py
  - requirements-pre.cuda.txt
  - requirements.cuda.txt
  - voirfile.py
- huggingface
  - dev.yaml
  - requirements.cuda.txt
  - voirfile.py
- inference
  - Makefile
  - README.md
  - benchfile.py
  - dev.yaml
  - main.py
  - prepare.py
  - requirements.cuda.txt
  - requirements.in
- lightning
  - dev.yaml
  - main.py
  - requirements.cuda.txt
  - voirfile.py
- llama
  - requirements.cuda.txt
  - voirfile.py
- llava
  - requirements.cuda.txt
  - voirfile.py
- llm
  - benchfile.py
  - dev.yaml
  - prepare.py
  - requirements.cuda.txt
  - voirfile.py
- purejaxrl
  - requirements.cuda.txt
  - requirements.in
  - voirfile.py
- recursiongfn
  - main.py
  - prepare.py
  - requirements.cuda.txt
  - voirfile.py
- retired
  - rwkv
    - voirfile.py
  - stable_baselines3
    - voirfile.py
  - stargan
    - voirfile.py
  - super-slomo
    - voirfile.py
  - torchatari
    - voirfile.py
- rlhf
  - prepare.py
  - requirements.cuda.txt
  - requirements.in
  - voirfile.py
- timm
  - benchfile.py
  - requirements.cuda.txt
  - voirfile.py
- torchvision_ddp
  - requirements.cuda.txt
- torchvision
  - requirements.cuda.txt
- vjepa
  - dev.yaml
  - prepare.py
  - requirements.cuda.txt
  - voirfile.py
- vllm
  - Makefile
  - README.md
  - benchfile.py
  - config.yaml
  - dev.yaml
  - main.py
  - prepare.py
  - requirements.cuda.txt
  - requirements.in
benchmate
- benchmate
- pyproject.toml
- tests
  - iterator.py
config
- all.yaml
- base.yaml
- clusters
  - clusters.yaml
  - slurm.yaml
- efficiency.yaml
- inference.yaml
- retired.yaml
- scaling
  - inference.yaml
- standard.yaml
- vllm.yaml
constraints
- cuda.txt
docker
milabench
- _version.py
- capability.py
- cli
  - __init__.py
  - container.py
  - docker.py
  - error.py
  - gather.py
  - global_patch.py
  - install.py
  - new.py
  - report.py
  - run.py
  - sharedsetup.py
  - tunnel.py
- commands
  - __init__.py
  - executors.py
- common.py
- config
  - __init__.py
  - config.py
- log.py
- loggers
  - __init__.py
  - http.py
- metadata.py
- metrics
  - report.py
  - sqlalchemy.py
- multi.py
- pack.py
- report
  - __init__.py
  - read.py
  - report.py
- scripts
  - torchversion.py
  - vcs.py
- sizer.py
- status
  - __init__.py
  - resume.py
- summary.py
- system.py
- testing.py
- validation
  - ensure_rate.py
  - error.py
  - loss.py
  - validation.py
- web
  - __init__.py
  - monitor.py
  - plot.py
  - push.py
  - realtime.py
  - slurm.py
  - template
  - utils.py
  - view.py
pyproject.toml
scripts
tests/test_validation
- test_capabilities.py

`‎.gitignore‎`

Original file line number	Diff line number	Diff line change
`@@ -63,3 +63,4 @@ benchmarks/*/src/`
`63`	`63`
`64`	`64`	`*.new.yml`
`65`	`65`	`*.png`
	`66`	`+fjobs_*.json`

Comments

(0)