Skip to content

Commit

Permalink
[Docs] Add optimizer-based autoscaling doc and examples (#692)
Browse files Browse the repository at this point in the history
* add opimizer-based autoscaling doc and examples

* update doc structure and simply doc and yaml

* Fix a rst syntax issue

---------

Co-authored-by: Jiaxin Shan <[email protected]>
  • Loading branch information
nwangfw and Jeffwan authored Feb 18, 2025
1 parent a2c89c9 commit 00d102c
Show file tree
Hide file tree
Showing 9 changed files with 169 additions and 18 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ spec:
path: /metrics/default/deepseek-llm-7b-chat
protocolType: http
targetMetric: vllm:deployment_replicas
targetValue: "1"
targetValue: "100"
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions docs/source/features/autoscaling/autoscaling.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
===========
Autoscaling
===========

Overview of AIBrix Autoscaler
-----------------------------

Autoscaling is crucial for deploying Large Language Model (LLM) services on Kubernetes (K8s), as timely scaling up handles peaks in request traffic, and scaling down conserves resources when demand wanes.

.. toctree::
:maxdepth: 1

metric-based-autoscaling
optimizer-based-autoscaling
Original file line number Diff line number Diff line change
@@ -1,15 +1,11 @@
.. _autoscaling:
.. _metric-based-autoscaling:

===========
Autoscaling
===========
===========================
Metric-based Autoscaling
===========================

Overview of AIBrix Autoscaler
-----------------------------

Autoscaling is crucial for deploying Large Language Model (LLM) services on Kubernetes (K8s), as timely scaling up handles peaks in request traffic, and scaling down conserves resources when demand wanes.

AIBrix Autoscaler includes various autoscaling components, allowing users to conveniently select the appropriate scaler. These options include the Knative-based Kubernetes Pod Autoscaler (KPA), the native Kubernetes Horizontal Pod Autoscaler (HPA), and AIBrix’s custom Advanced Pod Autoscaler (APA) tailored for LLM-serving.
AIBrix Autoscaler includes various metric-based autoscaling components, allowing users to conveniently select the appropriate scaler. These options include the Knative-based Kubernetes Pod Autoscaler (KPA), the native Kubernetes Horizontal Pod Autoscaler (HPA), and AIBrix’s custom Advanced Pod Autoscaler (APA) tailored for LLM-serving.

In the following sections, we will demonstrate how users can create various types of autoscalers within AIBrix.

Expand Down Expand Up @@ -49,20 +45,20 @@ All the sample files can be found in the following directory.
Example HPA yaml config
^^^^^^^^^^^^^^^^^^^^^^^

.. literalinclude:: ../../../samples/autoscaling/hpa.yaml
.. literalinclude:: ../../../../samples/autoscaling/hpa.yaml
:language: yaml

Example KPA yaml config
^^^^^^^^^^^^^^^^^^^^^^^

.. literalinclude:: ../../../samples/autoscaling/kpa.yaml
.. literalinclude:: ../../../../samples/autoscaling/kpa.yaml
:language: yaml


Example APA yaml config
^^^^^^^^^^^^^^^^^^^^^^^

.. literalinclude:: ../../../samples/autoscaling/apa.yaml
.. literalinclude:: ../../../../samples/autoscaling/apa.yaml
:language: yaml


Expand All @@ -81,7 +77,7 @@ check its logs in this way.
Expected log output. You can see the current metric is gpu_cache_usage_perc. You can check each pod's current metric value.

.. image:: ../assets/images/autoscaler/aibrix-controller-manager-output.png
.. image:: ../../assets/images/autoscaler/aibrix-controller-manager-output.png
:alt: AiBrix controller manager output
:width: 100%
:align: center
Expand All @@ -98,7 +94,7 @@ To describe the PodAutoscaler custom resource, you can run
Example output is here, you can explore the scaling conditions and events for more details.

.. image:: ../assets/images/autoscaler/podautoscaler-describe.png
.. image:: ../../assets/images/autoscaler/podautoscaler-describe.png
:alt: PodAutoscaler describe
:width: 100%
:align: center
Expand Down Expand Up @@ -132,7 +128,7 @@ In AiBrix, user can easily deploy different autoscaler by simply applying K8s ya
- There is no one autoscaler that outperforms others for all metrics (latency, cost). In addition, the results might depend on the workloads. Infrastructure should provide easy way to configure whichever autoscaling mechanism they want and should be easily configurable since different users have different preference. For example, one might prefer cost over performance or vice versa.


.. image:: ../assets/images/autoscaler/autoscaling_result.png
.. image:: ../../assets/images/autoscaler/autoscaling_result.png
:alt: result
:width: 70%
:align: center
117 changes: 117 additions & 0 deletions docs/source/features/autoscaling/optimizer-based-autoscaling.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
.. _optimizer-based-autoscaling:

==========================
Optimizer-based Autoscaler
==========================

Overview
--------

Optimizer-based Autoscaler is a proactive autoscaling solution which uses offline profiles of GPU to calculate the number of GPUs needed for the deployment rather than using GPU usage metrics. It includes (1) LLM Request Monitoring and (2) GPU Optimizer. The following figure shows the overall architecture. First, the LLM Request Monitoring component is responsible for monitoring past inference requests and their request patterns. Second, the GPU Optimizer component is responsible for calculating the optimal GPU number recommendation based on the request patterns and sending the recommendation to the K8s KPA.


.. figure:: ../../assets/images/autoscaler/optimizer-based-podautoscaler.png
:alt: optimizer-based-podautoscaler
:width: 100%
:align: center


How It Works
------------------------------------------

Step 1: Offline GPU-Model Benchmark per Input-Output Pattern
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Benchmark model. For each type of GPU, run ``aibrix_benchmark``. See `benchmark.sh <https://github.com/aibrix/aibrix/tree/main/python/aibrix/aibrix/gpu_optimizer/optimizer/profiling/benchmark.sh>`_ for more options.

.. code-block:: bash
kubectl port-forward [pod_name] 8010:8000 1>/dev/null 2>&1 &
# Wait for port-forward taking effect.
aibrix_benchmark -m deepseek-llm-7b-chat -o [path_to_benchmark_output]
Step 2: Decide SLO and GPU-Model Profile Generation, run `aibrix_gen_profile -h` for help.

.. code-block:: bash
kubectl -n aibrix-system port-forward svc/aibrix-redis-master 6379:6379 1>/dev/null 2>&1 &
# Wait for port-forward taking effect.
aibrix_gen_profile deepseek-llm-7b-chat-v100 --cost [cost1] [SLO-metric] [SLO-value] -o "redis://localhost:6379/?model=deepseek-llm-7b-chat"
Now the GPU Optimizer is ready to work. You should observe that the number of workload pods changes in response to the requests sent to the gateway. Once the GPU optimizer finishes the scaling optimization, the output of the GPU optimizer is passed to PodAutoscaler as a metricSource via a designated HTTP endpoint for the final scaling decision. The following is an example of PodAutoscaler spec.



The optimizer-based autoscaler decides the number of GPUs based on the offline GPU capacity profiling. It proactively calculates the overall capacity needed for serving requests under SLO and ensures that the GPU capacity is fully used but not overloaded. The GPU optimizer's output is exposed as custom metrics. The following shows how these custom metrics can be checked.

.. code-block:: bash
kubectl port-forward svc/aibrix-gpu-optimizer 8080:8080
curl http://localhost:8080/metrics/default/deepseek-llm-7b-chat-v100
# HELP vllm:deployment_replicas Number of suggested replicas.
# TYPE vllm:deployment_replicas gauge
vllm:deployment_replicas{model_name="deepseek-llm-7b-chat"} 1
How to deploy autoscaling object
--------------------------------
It is simply a matter of applying the podautoscaler yaml file. The GPU optimizer exposes custom metrics which can be used by podautoscalers to make scaling decisions as explained above. One important thing you should note is that the deployment name and the name in scaleTargetRef in PodAutoscaler must be the same. That's how AIBrix PodAutoscaler refers to the right deployment.

All the sample files can be found in the following directory.

.. code-block:: bash
https://github.com/aibrix/aibrix/tree/main/samples/autoscaling
Example Optimizer-based KPA yaml config
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. literalinclude:: ../../../../samples/autoscaling/optimizer-kpa.yaml
:language: yaml




Check autoscaling logs
----------------------

GPU optimizer Logs
^^^^^^^^^^^^^^^^^^^

Gpu optimizer is an individual component that plays the role of collecting metrics from each pod. You can check its logs in this way. ``kubectl logs <aibrix-gpu-optimizer-podname> -n aibrix-system -f``

.. code-block:: bash
{"time": "2025-02-12 06:23:52,086", "level": "INFO", "logger": "aibrix.gpu_optimizer.load_monitor", "message": "deepseek-llm-7b-chat optimization took 6.660938262939453 ms, cost $51.3324, coverage: 72.62180974477958%: [deepseek-llm-7b-chat-v100: 2($51.3324)]"}
In the above logs, the GPU optimizer returns the number of GPUs suggested, which is 2 in this example.


Preliminary experiments with different autoscalers
--------------------------------------------------

Here we show the preliminary experiment results to show how different autoscaling mechanisms and configurations for autoscalers affect performance(latency) and cost (compute cost).

- Set up
- Model: Deepseek 7B chatbot model
- GPU type: V100
- Max number of GPU: 8
- HPA, KPA, and APA use metrics as the scaling metrics: 70.
- Optimizer-based KPA SLO: E2E P99 100s
- Workload
- The overall RPS trend starts with low RPS and goes up relatively fast until T=500 to evaluate how different autoscaler and config reacts to the rapid load increase. After that, it goes down to low RPS quickly to evaluate scaling down behavior and goes up again slowly.
- Average RPS trend: 0.5 RPS -> 2 RPS -> 4 RPS -> 5 RPS -> 1 RPS -> 3 RPS


Experiments Results
^^^^^^^^^^^^^^^^^^^

- gpu_cache_usage_perc: 70

.. image:: ../../assets/images/autoscaler/optimizer-based-autoscaling-70-results.png
:alt: result
:width: 720px
:align: center
2 changes: 1 addition & 1 deletion docs/source/features/lora-dynamic-loading.rst
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ This is used to build the abstraction between controller manager and inference e
and remove the ``

.. code-block:: yaml
:emphasis-lines: 7
:emphasize-lines: 7
spec:
containers:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Documentation
features/gateway-plugins.rst
features/multi-node-inference.rst
features/heterogeneous-gpu.rst
features/autoscaling.rst
features/autoscaling/autoscaling.rst
features/runtime.rst
features/distributed-kv-cache.rst

Expand Down
24 changes: 24 additions & 0 deletions samples/autoscaling/optimizer-kpa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: autoscaling.aibrix.ai/v1alpha1
kind: PodAutoscaler
metadata:
name: deepseek-r1-distill-llama-8b-optimizer-scaling
namespace: default
labels:
app.kubernetes.io/name: aibrix
app.kubernetes.io/managed-by: kustomize
kpa.autoscaling.aibrix.ai/scale-down-delay: 0s
spec:
scalingStrategy: KPA
minReplicas: 1
maxReplicas: 8
metricsSources:
- endpoint: aibrix-gpu-optimizer.aibrix-system.svc.cluster.local:8080
metricSourceType: domain
path: /metrics/default/deepseek-r1-distill-llama-8b
protocolType: http
targetMetric: vllm:deployment_replicas
targetValue: "100"
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: deepseek-r1-distill-llama-8b

0 comments on commit 00d102c

Please sign in to comment.