[Docs] Add optimizer-based autoscaling doc and examples (#692)

* add opimizer-based autoscaling doc and examples * update doc structure and simply doc and yaml * Fix a rst syntax issue --------- Co-authored-by: Jiaxin Shan <[email protected]>
vllm-project · Feb 18, 2025 · 00d102c · 00d102c
1 parent a2c89c9
commit 00d102c
Show file tree

Hide file tree

Showing 9 changed files with 169 additions and 18 deletions.
diff --git a/benchmarks/autoscaling/deepseek-llm-7b-chat/optimizer-kpa.yaml b/benchmarks/autoscaling/deepseek-llm-7b-chat/optimizer-kpa.yaml
@@ -18,7 +18,7 @@ spec:
       path: /metrics/default/deepseek-llm-7b-chat
       protocolType: http
       targetMetric: vllm:deployment_replicas
-      targetValue: "1" 
+      targetValue: "100" 
   scaleTargetRef:
     apiVersion: apps/v1
     kind: Deployment

diff --git a/docs/source/assets/images/autoscaler/optimizer-based-autoscaling-70-results.png b/docs/source/assets/images/autoscaler/optimizer-based-autoscaling-70-results.png
diff --git a/docs/source/assets/images/autoscaler/optimizer-based-podautoscaler.png b/docs/source/assets/images/autoscaler/optimizer-based-podautoscaler.png
diff --git a/docs/source/features/autoscaling/autoscaling.rst b/docs/source/features/autoscaling/autoscaling.rst
@@ -0,0 +1,14 @@
+===========
+Autoscaling
+===========
+
+Overview of AIBrix Autoscaler
+-----------------------------
+
+Autoscaling is crucial for deploying Large Language Model (LLM) services on Kubernetes (K8s), as timely scaling up handles peaks in request traffic, and scaling down conserves resources when demand wanes. 
+
+.. toctree::
+   :maxdepth: 1
+
+   metric-based-autoscaling
+   optimizer-based-autoscaling
diff --git a/docs/source/features/autoscaling.rst → .../autoscaling/metric-based-autoscaling.rst b/docs/source/features/autoscaling.rst → .../autoscaling/metric-based-autoscaling.rst
@@ -1,15 +1,11 @@
-.. _autoscaling:
+.. _metric-based-autoscaling:
 
-===========
-Autoscaling
-===========
+===========================
+Metric-based Autoscaling
+===========================
 
-Overview of AIBrix Autoscaler
------------------------------
 
-Autoscaling is crucial for deploying Large Language Model (LLM) services on Kubernetes (K8s), as timely scaling up handles peaks in request traffic, and scaling down conserves resources when demand wanes.
-
-AIBrix Autoscaler includes various autoscaling components, allowing users to conveniently select the appropriate scaler. These options include the Knative-based Kubernetes Pod Autoscaler (KPA), the native Kubernetes Horizontal Pod Autoscaler (HPA), and AIBrix’s custom Advanced Pod Autoscaler (APA) tailored for LLM-serving.
+AIBrix Autoscaler includes various metric-based autoscaling components, allowing users to conveniently select the appropriate scaler. These options include the Knative-based Kubernetes Pod Autoscaler (KPA), the native Kubernetes Horizontal Pod Autoscaler (HPA), and AIBrix’s custom Advanced Pod Autoscaler (APA) tailored for LLM-serving.
 
 In the following sections, we will demonstrate how users can create various types of autoscalers within AIBrix.
 
@@ -49,20 +45,20 @@ All the sample files can be found in the following directory.
 Example HPA yaml config
 ^^^^^^^^^^^^^^^^^^^^^^^
 
-.. literalinclude:: ../../../samples/autoscaling/hpa.yaml
+.. literalinclude:: ../../../../samples/autoscaling/hpa.yaml
    :language: yaml
 
 Example KPA yaml config
 ^^^^^^^^^^^^^^^^^^^^^^^
 
-.. literalinclude:: ../../../samples/autoscaling/kpa.yaml
+.. literalinclude:: ../../../../samples/autoscaling/kpa.yaml
    :language: yaml
 
 
 Example APA yaml config
 ^^^^^^^^^^^^^^^^^^^^^^^
 
-.. literalinclude:: ../../../samples/autoscaling/apa.yaml
+.. literalinclude:: ../../../../samples/autoscaling/apa.yaml
    :language: yaml
 
 
@@ -81,7 +77,7 @@ check its logs in this way.
 
 Expected log output. You can see the current metric is gpu_cache_usage_perc. You can check each pod's current metric value.
 
-.. image:: ../assets/images/autoscaler/aibrix-controller-manager-output.png
+.. image:: ../../assets/images/autoscaler/aibrix-controller-manager-output.png
    :alt: AiBrix controller manager output
    :width: 100%
    :align: center
@@ -98,7 +94,7 @@ To describe the PodAutoscaler custom resource, you can run
 
 Example output is here, you can explore the scaling conditions and events for more details.
 
-.. image:: ../assets/images/autoscaler/podautoscaler-describe.png
+.. image:: ../../assets/images/autoscaler/podautoscaler-describe.png
    :alt: PodAutoscaler describe
    :width: 100%
    :align: center
@@ -132,7 +128,7 @@ In AiBrix, user can easily deploy different autoscaler by simply applying K8s ya
     - There is no one autoscaler that outperforms others for all metrics (latency, cost). In addition, the results might depend on the workloads. Infrastructure should provide easy way to configure whichever autoscaling mechanism they want and should be easily configurable since different users have different preference. For example, one might prefer cost over performance or vice versa. 
 
 
-.. image:: ../assets/images/autoscaler/autoscaling_result.png
+.. image:: ../../assets/images/autoscaler/autoscaling_result.png
    :alt: result
    :width: 70%
    :align: center
diff --git a/docs/source/features/autoscaling/optimizer-based-autoscaling.rst b/docs/source/features/autoscaling/optimizer-based-autoscaling.rst
@@ -0,0 +1,117 @@
+.. _optimizer-based-autoscaling:
+
+==========================
+Optimizer-based Autoscaler
+==========================
+
+Overview 
+--------
+
+Optimizer-based Autoscaler is a proactive autoscaling solution which uses offline profiles of GPU to calculate the number of GPUs needed for the deployment rather than using GPU usage metrics. It includes (1) LLM Request Monitoring and (2) GPU Optimizer. The following figure shows the overall architecture. First, the LLM Request Monitoring component is responsible for monitoring past inference requests and their request patterns. Second, the GPU Optimizer component is responsible for calculating the optimal GPU number recommendation based on the request patterns and sending the recommendation to the K8s KPA.
+
+
+.. figure:: ../../assets/images/autoscaler/optimizer-based-podautoscaler.png
+  :alt: optimizer-based-podautoscaler
+  :width: 100%
+  :align: center
+
+
+How It Works
+------------------------------------------
+
+Step 1: Offline GPU-Model Benchmark per Input-Output Pattern
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Benchmark model. For each type of GPU, run ``aibrix_benchmark``. See `benchmark.sh <https://github.com/aibrix/aibrix/tree/main/python/aibrix/aibrix/gpu_optimizer/optimizer/profiling/benchmark.sh>`_ for more options.
+
+.. code-block:: bash
+
+    kubectl port-forward [pod_name] 8010:8000 1>/dev/null 2>&1 &
+    # Wait for port-forward taking effect.
+    aibrix_benchmark -m deepseek-llm-7b-chat -o [path_to_benchmark_output]
+
+
+Step 2: Decide SLO and GPU-Model Profile Generation, run `aibrix_gen_profile -h` for help.
+
+.. code-block:: bash
+
+    kubectl -n aibrix-system port-forward svc/aibrix-redis-master 6379:6379 1>/dev/null 2>&1 &
+    # Wait for port-forward taking effect.
+    aibrix_gen_profile deepseek-llm-7b-chat-v100 --cost [cost1] [SLO-metric] [SLO-value] -o "redis://localhost:6379/?model=deepseek-llm-7b-chat"
+
+
+Now the GPU Optimizer is ready to work. You should observe that the number of workload pods changes in response to the requests sent to the gateway. Once the GPU optimizer finishes the scaling optimization, the output of the GPU optimizer is passed to PodAutoscaler as a metricSource via a designated HTTP endpoint for the final scaling decision.  The following is an example of PodAutoscaler spec.
+
+
+
+The optimizer-based autoscaler decides the number of GPUs based on the offline GPU capacity profiling. It proactively calculates the overall capacity needed for serving requests under SLO and ensures that the GPU capacity is fully used but not overloaded. The GPU optimizer's output is exposed as custom metrics. The following shows how these custom metrics can be checked.
+
+.. code-block:: bash
+
+    kubectl port-forward svc/aibrix-gpu-optimizer 8080:8080
+
+    curl http://localhost:8080/metrics/default/deepseek-llm-7b-chat-v100
+    # HELP vllm:deployment_replicas Number of suggested replicas.
+    # TYPE vllm:deployment_replicas gauge
+    vllm:deployment_replicas{model_name="deepseek-llm-7b-chat"} 1
+
+
+How to deploy autoscaling object
+--------------------------------
+It is simply a matter of applying the podautoscaler yaml file. The GPU optimizer exposes custom metrics which can be used by podautoscalers to make scaling decisions as explained above. One important thing you should note is that the deployment name and the name in scaleTargetRef in PodAutoscaler must be the same. That's how AIBrix PodAutoscaler refers to the right deployment.
+
+All the sample files can be found in the following directory. 
+
+.. code-block:: bash
+    
+    https://github.com/aibrix/aibrix/tree/main/samples/autoscaling
+
+Example Optimizer-based KPA yaml config
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. literalinclude:: ../../../../samples/autoscaling/optimizer-kpa.yaml
+   :language: yaml
+
+
+
+
+Check autoscaling logs
+----------------------
+
+GPU optimizer Logs
+^^^^^^^^^^^^^^^^^^^
+
+Gpu optimizer is an individual component that plays the role of collecting metrics from each pod. You can check its logs in this way. ``kubectl logs <aibrix-gpu-optimizer-podname> -n aibrix-system -f``
+
+.. code-block:: bash
+
+    {"time": "2025-02-12 06:23:52,086", "level": "INFO", "logger": "aibrix.gpu_optimizer.load_monitor", "message": "deepseek-llm-7b-chat optimization took 6.660938262939453 ms, cost $51.3324, coverage: 72.62180974477958%: [deepseek-llm-7b-chat-v100: 2($51.3324)]"}
+
+In the above logs, the GPU optimizer returns the number of GPUs suggested, which is 2 in this example. 
+
+
+Preliminary experiments with different autoscalers
+--------------------------------------------------
+
+Here we show the preliminary experiment results to show how different autoscaling mechanisms and configurations for autoscalers affect performance(latency) and cost (compute cost). 
+
+- Set up
+    - Model: Deepseek 7B chatbot model
+    - GPU type: V100
+    - Max number of GPU: 8
+    - HPA, KPA, and APA use metrics as the scaling metrics: 70.
+    - Optimizer-based KPA SLO: E2E P99 100s
+- Workload
+    - The overall RPS trend starts with low RPS and goes up relatively fast until T=500 to evaluate how different autoscaler and config reacts to the rapid load increase. After that, it goes down to low RPS quickly to evaluate scaling down behavior and goes up again slowly.
+        - Average RPS trend: 0.5 RPS -> 2 RPS -> 4 RPS -> 5 RPS -> 1 RPS -> 3 RPS
+
+
+Experiments Results
+^^^^^^^^^^^^^^^^^^^
+
+- gpu_cache_usage_perc: 70
+
+.. image:: ../../assets/images/autoscaler/optimizer-based-autoscaling-70-results.png
+   :alt: result
+   :width: 720px
+   :align: center
diff --git a/docs/source/features/lora-dynamic-loading.rst b/docs/source/features/lora-dynamic-loading.rst
@@ -199,7 +199,7 @@ This is used to build the abstraction between controller manager and inference e
 and remove the ``
 
 .. code-block:: yaml
-  :emphasis-lines: 7
+  :emphasize-lines: 7
 
     spec:
       containers:

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -40,7 +40,7 @@ Documentation
    features/gateway-plugins.rst
    features/multi-node-inference.rst
    features/heterogeneous-gpu.rst
-   features/autoscaling.rst
+   features/autoscaling/autoscaling.rst
    features/runtime.rst
    features/distributed-kv-cache.rst
 

diff --git a/samples/autoscaling/optimizer-kpa.yaml b/samples/autoscaling/optimizer-kpa.yaml
@@ -0,0 +1,24 @@
+apiVersion: autoscaling.aibrix.ai/v1alpha1
+kind: PodAutoscaler
+metadata:
+  name: deepseek-r1-distill-llama-8b-optimizer-scaling
+  namespace: default
+  labels:
+    app.kubernetes.io/name: aibrix
+    app.kubernetes.io/managed-by: kustomize
+    kpa.autoscaling.aibrix.ai/scale-down-delay: 0s
+spec:
+  scalingStrategy: KPA 
+  minReplicas: 1
+  maxReplicas: 8
+  metricsSources:
+  - endpoint: aibrix-gpu-optimizer.aibrix-system.svc.cluster.local:8080
+    metricSourceType: domain
+    path: /metrics/default/deepseek-r1-distill-llama-8b
+    protocolType: http
+    targetMetric: vllm:deployment_replicas
+    targetValue: "100"
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: deepseek-r1-distill-llama-8b