Adding Fast Model Actuation (FMA) to WVA's benchmarking framework by aavarghese · Pull Request #988 · llm-d/llm-d-workload-variant-autoscaler

aavarghese · 2026-04-10T21:40:17Z

Summary

Integrate Fast Model Actuation (FMA) deployment and benchmarking into WVA's benchmark framework. FMA provides fast model actuation through pre-provisioned launcher pods that host vLLM instances with sleep/wake functionality. This PR enables measuring FMA actuation time (cold start, warm wake-up, sleeping instance hit rate) alongside WVA benchmarks but in isolation until we get WVA to read metrics from FMA-managed vLLM instances and compute wva_desired_replicas from FMA launcher pods.

For more info: https://github.com/llm-d-incubation/llm-d-fast-model-actuation

Test plan

make test-benchmark-fma runs FMA actuation benchmark on Kind with emulated GPUs
hack/benchmark/run/run_ci_benchmark.sh -f runs on OpenShift with real GPUs
/benchmark kind CI workflow deploys FMA and runs both WVA and FMA benchmarks
FMA results appear in PR comment table

Generated with https://claude.com/claude-code

ev-shindin · 2026-04-14T10:43:53Z

@aavarghese what is the status of FMA now? How is it actuated? Should we think on additional actuator that explicitly use FMA?

lionelvillard · 2026-04-14T12:30:00Z

          INSTALL_GRAFANA: "true"
        run: make deploy-e2e-infra

+      - name: Install ko for FMA image builds


one requirement is to be able to trigger benchmark from a laptop. I would probably move this logic to the install.sh script, which is called by "make deploy-e2e-infra"

Moved ko install, RBAC, gpu-map and other FMA setup stuff to infra_fma.sh
From laptop for WVA+FMA :

DEPLOY_FMA=true FMA_REPO_PATH=../llm-d-fast-model-actuation make deploy-e2e-infra

aavarghese · 2026-04-14T14:19:20Z

@aavarghese what is the status of FMA now? How is it actuated? Should we think on additional actuator that explicitly use FMA?

@ev-shindin thanks for your look at this PR! FMA is currently preparing for production-readiness - the team is going through some final improvmenets to release what we call a Milestone 3 version to be used by WVA and others...
With FMA, the "actuation" happens when scaling up i.e creating a new server-requester pod which triggers FMA's controller to bind a launcher pod and wake-up an existing vLLM instance (preferably) or create a new vLLM instance in an already existing launcher pod that was pre-provisioned on every node in the cluster and saves us on pod creation time, etc.
Re: should we have an additional actuator...I'm not sure, do you mean within WVA? Have another mechansim of scaling the Deployment when metrics indicate its needed? I'll have to check with @dumb0002 @diegocastanibm etc...

Signed-off-by: aavarghese <avarghese@us.ibm.com>

ev-shindin · 2026-04-15T04:46:47Z

@aavarghese Have another mechansim of scaling the Deployment when metrics indicate its needed?

Yes I would like to understand how it will work. For regular scaling (not scale fom zero) WVA issue prometheus metric and then we have HPA/KEDA configured to scale-up/down based on this metric. Do we need to have some additional mechanism to use FMA? Or its fully transparent for us, i.e. you have a mechanism that will wake-up pod when HPA/KEDA will increase number of replicas?

diegocastanibm · 2026-04-15T13:12:45Z

@aavarghese Have another mechansim of scaling the Deployment when metrics indicate its needed?

Yes I would like to understand how it will work. For regular scaling (not scale fom zero) WVA issue prometheus metric and then we have HPA/KEDA configured to scale-up/down based on this metric. Do we need to have some additional mechanism to use FMA? Or its fully transparent for us, i.e. you have a mechanism that will wake-up pod when HPA/KEDA will increase number of replicas?

FMA is fully transparent for HPA/KEDA. The flow is:

WVA emits wva_desired_replicas metric → HPA/KEDA scales the server-requesting Pod ReplicaSet (increases replicas)
New requesting Pod appears → FMA controller detects it automatically
FMA finds a sleeping vLLM instance (or creates one) and wakes it up via /wake_up

On scale-down: HPA/KEDA reduces replicas → requesting Pod deleted → FMA puts the vLLM instance to sleep via sleep → instance stays dormant in the launcher Pod, ready for fast reuse.

No additional mechanism needed. HPA/KEDA scales the requesting Pods as usual, FMA reacts to the Pod lifecycle events. The only requirement is that the requesting Pods have the FMA annotation (dual-pods.llm-d.ai/inference-server-config).

aavarghese · 2026-04-15T13:19:54Z

@ev-shindin Do we need to have some additional mechanism to use FMA? Or its fully transparent for us, i.e. you have a mechanism that will wake-up pod when HPA/KEDA will increase number of replicas?

Adding to what @diegocastanibm said, can peak here for a quick unofficial FMA numbers.
As part of the integration work, FMA team is looking into how WVA can read metrics from FMA's launchers...

Signed-off-by: aavarghese <avarghese@us.ibm.com>

aavarghese force-pushed the fmaintegration branch 2 times, most recently from 36556c4 to 32fbdda Compare April 13, 2026 14:12

lionelvillard reviewed Apr 14, 2026

View reviewed changes

Adding Fast Model Actuation (FMA) to WVA's benchmarking framework

d3654f1

Signed-off-by: aavarghese <avarghese@us.ibm.com>

aavarghese force-pushed the fmaintegration branch from 531d107 to d3654f1 Compare April 14, 2026 22:17

aavarghese added 2 commits April 15, 2026 11:49

Adding Fast Model Actuation (FMA) to WVA's benchmarking framework

cd1e299

Signed-off-by: aavarghese <avarghese@us.ibm.com>

Adding Fast Model Actuation (FMA) to WVA's benchmarking framework

3e096a1

Signed-off-by: aavarghese <avarghese@us.ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Fast Model Actuation (FMA) to WVA's benchmarking framework#988

Adding Fast Model Actuation (FMA) to WVA's benchmarking framework#988
aavarghese wants to merge 3 commits intollm-d:mainfrom
aavarghese:fmaintegration

aavarghese commented Apr 10, 2026 •

edited

Loading

Uh oh!

ev-shindin commented Apr 14, 2026

Uh oh!

lionelvillard Apr 14, 2026

Uh oh!

aavarghese Apr 15, 2026

Uh oh!

aavarghese commented Apr 14, 2026

Uh oh!

ev-shindin commented Apr 15, 2026

Uh oh!

diegocastanibm commented Apr 15, 2026

Uh oh!

aavarghese commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

aavarghese commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ev-shindin commented Apr 14, 2026

Uh oh!

lionelvillard Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

aavarghese Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

aavarghese commented Apr 14, 2026

Uh oh!

ev-shindin commented Apr 15, 2026

Uh oh!

diegocastanibm commented Apr 15, 2026

Uh oh!

aavarghese commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aavarghese commented Apr 10, 2026 •

edited

Loading

aavarghese commented Apr 15, 2026 •

edited

Loading