Commit c44127e
New Inference benchmarks (#382)
* Improve issue reporting (#368)
* Add job push button (#370)
* Add job push button
* Add status helper to avoid rsync when cache is good
* Tweak milabench realtime event tracking
* Add unmerged code
* Tweak to enable perfectly forwarding milabench events to external soures
* Make sure multinode jobs run on the right directory
* Sync with upstream
* Use milabench utilities to create the milaench container_run cli
* Patch tqdm to avoid flodding logs with meaningless progress update
* Try to ensure the logs are flushed when an issue happens to avoid slurm eating some lines
* Add timed log flush
* Tweak timed flush
* Add new inspection routes
* Update script to use faster setup
* New shared_prepare
* Tweak milabench global patch installation
* Add SQL a valid metric pusher
* Add SSH debug loging for tunnels
* Handle database reconnection gracefully
* refactor configuration resolution (#376)
* Client Server bench concept
* new vLLM inference benchmark
* new vllm and whisper bench
* New inference benchmark for flux and whisper
* Flux inference bench
* New Text generation benchmark
* Normalize slurm configuration names
* Print a warning if we have no profile set
* fno_bench initial
* Move server code to the dashboard repository
* additional fix:1
* arg parsing
* Fix some issues with milabench new not replacing some placeholder values
* Tweaks to huggingface environment folders
* Toggle the inference benchmarks on
* Added some TimedIterator tests
* Add some checks to the timed iterator
* Add the option to fetch the first batch in TimedIterator to reduce measurements variation
* Adding milabench to ngc container
* Prepare tweaks
* Refactor to use generic huggingface download model and dataset
* Tweak the prepare script to use no split by defaults
* Add a new 'all' config
* Force HF_HUB_CACHE to unify behaviour between benchmarks
* Pin dependencies for new benchmarks
* Avoid huggingface for Whisper inference
* Tweak batch sizes
* SPARK Tweaks
* vllm sweep concept
* Add a new IPMI monitor that starts and ends with milabench run
* Add new Kj energy spent estimate
* Put real time attached to cuda event
* Updated pin to torch==2.8
* Implement global throughput sampling
* update benchrun to match new pytorchrun API
* Add new GPU Poll override
* New timeline script to display batch id
* Add Energy stats guard on division by zero
* Tweak error report to not break exception trace lines
* More robust stracktrace extraction
* Fix backward compatibility problem with pytorch 2.8 & 2.9
* Update JAX libraries
* Tweaks to support latest version of dependencies
* Make sure IP is set for the IPMI monitor
* Tweaks for full run
* -
* Restore time estimate of the rate time
* update gpu_poll to be float
* Gatehr tweaks
* New milabench event processor
* report tweaking
* New reporting functions
* Consolidate configs inside SystemConfig
* Tweak the unified config setting
* Full milabench resume implementation
* add new dense llm sweep
* Ignore unmerged extension
* Update code to use the new system structure
* Fix for docker not parsing the version file correctly
* Tweak IMPI monitor to be an op when not set
---------
Co-authored-by: pierre.delaunay <delaunap@rtx5.server.mila.quebec>
Co-authored-by: john2 <john2@jrlogin08.jureca>
Co-authored-by: john2 <john2@jrlogin09.jureca>
Co-authored-by: Pierre Delaunay <pierre.delaunay@mila.quebec>1 parent f528806 commit c44127e
203 files changed
Lines changed: 10773 additions & 6741 deletions
File tree
- .pin
- benchmarks
- _templates
- simple
- stdout
- voir
- accelerate_opt
- brax
- cleanrl_jax
- diffusion
- dinov2
- dlrm
- flops
- fno_benchmark
- geo_gnn
- huggingface
- inference
- lightning
- llama
- llava
- llm
- purejaxrl
- recursiongfn
- retired
- rwkv
- stable_baselines3
- stargan
- super-slomo
- torchatari
- rlhf
- timm
- torchvision_ddp
- torchvision
- vjepa
- vllm
- benchmate
- benchmate
- tests
- config
- clusters
- scaling
- constraints
- docker
- scripts
- milabench
- cli
- commands
- config
- loggers
- metrics
- report
- scripts
- status
- validation
- web
- template
- scripts
- pipeline
- slurm
- tests/test_validation
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
| 66 | + | |
0 commit comments