Adding Fast Model Actuation (FMA) to WVA's benchmarking framework#988
Adding Fast Model Actuation (FMA) to WVA's benchmarking framework#988aavarghese wants to merge 3 commits intollm-d:mainfrom
Conversation
36556c4 to
32fbdda
Compare
|
@aavarghese what is the status of FMA now? How is it actuated? Should we think on additional actuator that explicitly use FMA? |
| INSTALL_GRAFANA: "true" | ||
| run: make deploy-e2e-infra | ||
|
|
||
| - name: Install ko for FMA image builds |
There was a problem hiding this comment.
one requirement is to be able to trigger benchmark from a laptop. I would probably move this logic to the install.sh script, which is called by "make deploy-e2e-infra"
There was a problem hiding this comment.
Moved ko install, RBAC, gpu-map and other FMA setup stuff to infra_fma.sh
From laptop for WVA+FMA :
DEPLOY_FMA=true FMA_REPO_PATH=../llm-d-fast-model-actuation make deploy-e2e-infra
@ev-shindin thanks for your look at this PR! FMA is currently preparing for production-readiness - the team is going through some final improvmenets to release what we call a Milestone 3 version to be used by WVA and others... |
Signed-off-by: aavarghese <avarghese@us.ibm.com>
531d107 to
d3654f1
Compare
Yes I would like to understand how it will work. For regular scaling (not scale fom zero) WVA issue prometheus metric and then we have HPA/KEDA configured to scale-up/down based on this metric. Do we need to have some additional mechanism to use FMA? Or its fully transparent for us, i.e. you have a mechanism that will wake-up pod when HPA/KEDA will increase number of replicas? |
FMA is fully transparent for HPA/KEDA. The flow is:
On scale-down: HPA/KEDA reduces replicas → requesting Pod deleted → FMA puts the vLLM instance to sleep via sleep → instance stays dormant in the launcher Pod, ready for fast reuse. No additional mechanism needed. HPA/KEDA scales the requesting Pods as usual, FMA reacts to the Pod lifecycle events. The only requirement is that the requesting Pods have the FMA annotation (dual-pods.llm-d.ai/inference-server-config). |
Adding to what @diegocastanibm said, can peak here for a quick unofficial FMA numbers. |
Signed-off-by: aavarghese <avarghese@us.ibm.com>
Signed-off-by: aavarghese <avarghese@us.ibm.com>
Summary
Integrate Fast Model Actuation (FMA) deployment and benchmarking into WVA's benchmark framework. FMA provides fast model actuation through pre-provisioned launcher pods that host vLLM instances with sleep/wake functionality. This PR enables measuring FMA actuation time (cold start, warm wake-up, sleeping instance hit rate) alongside WVA benchmarks but in isolation until we get WVA to read metrics from FMA-managed vLLM instances and compute wva_desired_replicas from FMA launcher pods.
For more info: https://github.com/llm-d-incubation/llm-d-fast-model-actuation
Test plan
Generated with https://claude.com/claude-code