Skip to content

Adding Fast Model Actuation (FMA) to WVA's benchmarking framework#988

Draft
aavarghese wants to merge 3 commits intollm-d:mainfrom
aavarghese:fmaintegration
Draft

Adding Fast Model Actuation (FMA) to WVA's benchmarking framework#988
aavarghese wants to merge 3 commits intollm-d:mainfrom
aavarghese:fmaintegration

Conversation

@aavarghese
Copy link
Copy Markdown

@aavarghese aavarghese commented Apr 10, 2026

Summary

Integrate Fast Model Actuation (FMA) deployment and benchmarking into WVA's benchmark framework. FMA provides fast model actuation through pre-provisioned launcher pods that host vLLM instances with sleep/wake functionality. This PR enables measuring FMA actuation time (cold start, warm wake-up, sleeping instance hit rate) alongside WVA benchmarks but in isolation until we get WVA to read metrics from FMA-managed vLLM instances and compute wva_desired_replicas from FMA launcher pods.

For more info: https://github.com/llm-d-incubation/llm-d-fast-model-actuation

Test plan

  • make test-benchmark-fma runs FMA actuation benchmark on Kind with emulated GPUs
  • hack/benchmark/run/run_ci_benchmark.sh -f runs on OpenShift with real GPUs
  • /benchmark kind CI workflow deploys FMA and runs both WVA and FMA benchmarks
  • FMA results appear in PR comment table

Generated with https://claude.com/claude-code

@aavarghese aavarghese force-pushed the fmaintegration branch 2 times, most recently from 36556c4 to 32fbdda Compare April 13, 2026 14:12
@ev-shindin
Copy link
Copy Markdown
Collaborator

@aavarghese what is the status of FMA now? How is it actuated? Should we think on additional actuator that explicitly use FMA?

Comment thread .github/workflows/ci-benchmark.yaml Outdated
INSTALL_GRAFANA: "true"
run: make deploy-e2e-infra

- name: Install ko for FMA image builds
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one requirement is to be able to trigger benchmark from a laptop. I would probably move this logic to the install.sh script, which is called by "make deploy-e2e-infra"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved ko install, RBAC, gpu-map and other FMA setup stuff to infra_fma.sh
From laptop for WVA+FMA :

DEPLOY_FMA=true FMA_REPO_PATH=../llm-d-fast-model-actuation make deploy-e2e-infra 

@aavarghese
Copy link
Copy Markdown
Author

@aavarghese what is the status of FMA now? How is it actuated? Should we think on additional actuator that explicitly use FMA?

@ev-shindin thanks for your look at this PR! FMA is currently preparing for production-readiness - the team is going through some final improvmenets to release what we call a Milestone 3 version to be used by WVA and others...
With FMA, the "actuation" happens when scaling up i.e creating a new server-requester pod which triggers FMA's controller to bind a launcher pod and wake-up an existing vLLM instance (preferably) or create a new vLLM instance in an already existing launcher pod that was pre-provisioned on every node in the cluster and saves us on pod creation time, etc.
Re: should we have an additional actuator...I'm not sure, do you mean within WVA? Have another mechansim of scaling the Deployment when metrics indicate its needed? I'll have to check with @dumb0002 @diegocastanibm etc...

Signed-off-by: aavarghese <avarghese@us.ibm.com>
@ev-shindin
Copy link
Copy Markdown
Collaborator

@aavarghese Have another mechansim of scaling the Deployment when metrics indicate its needed?

Yes I would like to understand how it will work. For regular scaling (not scale fom zero) WVA issue prometheus metric and then we have HPA/KEDA configured to scale-up/down based on this metric. Do we need to have some additional mechanism to use FMA? Or its fully transparent for us, i.e. you have a mechanism that will wake-up pod when HPA/KEDA will increase number of replicas?

@diegocastanibm
Copy link
Copy Markdown

@aavarghese Have another mechansim of scaling the Deployment when metrics indicate its needed?

Yes I would like to understand how it will work. For regular scaling (not scale fom zero) WVA issue prometheus metric and then we have HPA/KEDA configured to scale-up/down based on this metric. Do we need to have some additional mechanism to use FMA? Or its fully transparent for us, i.e. you have a mechanism that will wake-up pod when HPA/KEDA will increase number of replicas?

FMA is fully transparent for HPA/KEDA. The flow is:

  1. WVA emits wva_desired_replicas metric → HPA/KEDA scales the server-requesting Pod ReplicaSet (increases replicas)
  2. New requesting Pod appears → FMA controller detects it automatically
  3. FMA finds a sleeping vLLM instance (or creates one) and wakes it up via /wake_up

On scale-down: HPA/KEDA reduces replicas → requesting Pod deleted → FMA puts the vLLM instance to sleep via sleep → instance stays dormant in the launcher Pod, ready for fast reuse.

No additional mechanism needed. HPA/KEDA scales the requesting Pods as usual, FMA reacts to the Pod lifecycle events. The only requirement is that the requesting Pods have the FMA annotation (dual-pods.llm-d.ai/inference-server-config).

@aavarghese
Copy link
Copy Markdown
Author

aavarghese commented Apr 15, 2026

@ev-shindin Do we need to have some additional mechanism to use FMA? Or its fully transparent for us, i.e. you have a mechanism that will wake-up pod when HPA/KEDA will increase number of replicas?

Adding to what @diegocastanibm said, can peak here for a quick unofficial FMA numbers.
As part of the integration work, FMA team is looking into how WVA can read metrics from FMA's launchers...

Signed-off-by: aavarghese <avarghese@us.ibm.com>
Signed-off-by: aavarghese <avarghese@us.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants