vllm-project · rafvasq · Jun 17, 2025 · Jun 10, 2025 · Jun 11, 2025 · Jun 11, 2025
@@ -2,27 +2,32 @@ nav:
   - Home:
     - vLLM Spyre Plugin: README.md
     - Getting Started:
-        - Installation: getting_started/installation.md
+      - Installation: getting_started/installation.md
     - Deploying:
-        - Docker: deploying/docker.md
-        - Kubernetes: deploying/k8s.md
+      - Docker: deploying/docker.md
+      - Kubernetes: deploying/k8s.md
+      - Red Hat OpenShift AI: deploying/rhoai.md
     - Examples:
       - Offline Inference: examples/offline_inference
       - Other: examples/other
     - User Guide:
-        - Configuration: user_guide/configuration.md
-        - Environment Variables: user_guide/env_vars.md
-        - Supported Features: user_guide/supported_features.md
-        - Supported Models: user_guide/supported_models.md
+      - Configuration: user_guide/configuration.md
+      - Environment Variables: user_guide/env_vars.md
+      - Supported Features: user_guide/supported_features.md
+      - Supported Models: user_guide/supported_models.md
     - Developer Guide:
       - Contributing: contributing/README.md
 
   - Getting Started:
-      - Installation: getting_started/installation.md
+    - Installation: getting_started/installation.md
+  - Deploying:
+    - Docker: deploying/docker.md
+    - Kubernetes: deploying/k8s.md
+    - Red Hat OpenShift AI: deploying/rhoai.md
   - User Guide:
-      - Configuration: user_guide/configuration.md
-      - Environment Variables: user_guide/env_vars.md
-      - Supported Features: user_guide/supported_features.md
-      - Supported Models: user_guide/supported_models.md
+    - Configuration: user_guide/configuration.md
+    - Environment Variables: user_guide/env_vars.md
+    - Supported Features: user_guide/supported_features.md
+    - Supported Models: user_guide/supported_models.md
   - Developer Guide:
     - Contributing: contributing/README.md
@@ -61,6 +61,8 @@ The vLLM Documentation on [Deploying with Kubernetes](https://docs.vllm.ai/en/la
         labels:
           app: granite-8b-instruct
       spec:
+        # Defaults to 600 and must be set higher if your startupProbe needs to wait longer than that 
+        progressDeadlineSeconds: 1200
         replicas: 1
         selector:
           matchLabels:
@@ -70,6 +72,8 @@ The vLLM Documentation on [Deploying with Kubernetes](https://docs.vllm.ai/en/la
             labels:
               app: granite-8b-instruct
           spec:
+            # Required for scheduling spyre cards
+            schedulerName: aiu-scheduler
             volumes:
             - name: hf-cache-volume
               persistentVolumeClaim:
@@ -127,15 +131,19 @@ The vLLM Documentation on [Deploying with Kubernetes](https://docs.vllm.ai/en/la
                 httpGet:
                   path: /health
                   port: 8000
-                # Long startup delays are necessary for graph compilation
-                initialDelaySeconds: 1200
                 periodSeconds: 10
               readinessProbe:
                 httpGet:
                   path: /health
                   port: 8000
-                initialDelaySeconds: 600
                 periodSeconds: 5
+              startupProbe:
+                httpGet:
+                  path: /health
+                  port: 8000
+                periodSeconds: 10
+                # Long startup delays are necessary for graph compilation
+                failureThreshold: 120
       ---
       apiVersion: v1
       kind: Service

@@ -0,0 +1,104 @@
+# Using Red Hat OpenShift AI
+
+[Red Hat OpenShift AI](https://www.redhat.com/en/products/ai/openshift-ai) is a cloud-native AI platform that bundles together many popular model management projects, including [KServe](https://kserve.github.io/website/latest/).
+
+This example shows how to use KServe with RHOAI to deploy a model on OpenShift, using a modelcar image to load the model without requiring any connection to Huggingface Hub.
+
+## Deploying with KServe
+
+!!! note "Prerequisites"
+    * A running Kubernetes cluster with RHOAI installed
+    * Image pull credentials for `registry.redhat.io/rhelai1`
+    * Spyre accelerators available in the cluster
+
+<!-- TODO: Link to public docs for cluster setup -->
+
+1. Create a ServingRuntime to serve your models.
+
+      ```yaml
+        oc apply -f - <<EOF
+        apiVersion: serving.kserve.io/v1alpha1
+        kind: ServingRuntime
+        metadata:
+          name: vllm-spyre-runtime
+          annotations:
+            openshift.io/display-name: vLLM IBM Spyre ServingRuntime for KServe
+            opendatahub.io/recommended-accelerators: '["ibm.com/aiu_pf"]'
+          labels:
+            opendatahub.io/dashboard: "true"
+        spec:
+          multiModel: false
+          supportedModelFormats:
+            - autoSelect: true
+              name: vLLM
+          containers:
+            - name: kserve-container
+              image: quay.io/ibm-aiu/vllm-spyre:latest.amd64
+              args:
+                - /mnt/models
+                - --served-model-name={{.Name}}
+              env:
+                - name: HF_HOME
+                  value: /tmp/hf_home
+                # Static batching configurations can also be set on each InferenceService
+                - name: VLLM_SPYRE_WARMUP_BATCH_SIZES
+                  value: '4'
+                - name: VLLM_SPYRE_WARMUP_PROMPT_LENS
+                  value: '1024'
+                - name: VLLM_SPYRE_WARMUP_NEW_TOKENS
+                  value: '256'
+              ports:
+                - containerPort: 8000
+                  protocol: TCP
+        EOF
+      ```
+
+2. Create an InferenceService for each model you want to deploy. This example demonstrates how to deploy the [Granite](https://www.ibm.com/granite) model `ibm-granite/granite-3.1-8b-instruct`.
+
+      ```yaml
+      oc apply -f - <<EOF
+      apiVersion: serving.kserve.io/v1beta1
+      kind: InferenceService
+      metadata:
+        annotations:
+          openshift.io/display-name: granite-3-1-8b-instruct
+          serving.kserve.io/deploymentMode: RawDeployment
+        name: granite-3-1-8b-instruct
+        labels:
+          opendatahub.io/dashboard: 'true'
+      spec:
+        predictor:
+          imagePullSecrets:
+            - name: oci-registry
+          maxReplicas: 1
+          minReplicas: 1
+          model:
+            modelFormat:
+              name: vLLM
+            name: ''
+            resources:
+              limits:
+                ibm.com/aiu_pf: '1'
+              requests:
+                ibm.com/aiu_pf: '1'
+            runtime: vllm-spyre-runtime
+            storageUri: 'oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.5'
+            volumeMounts:
+              - mountPath: /dev/shm
+                name: shm
+          schedulerName: aiu-scheduler
+          tolerations:
+            - effect: NoSchedule
+              key: ibm.com/aiu_pf
+              operator: Exists
+                    spec:
+          volumes:
+            # This volume may need to be larger for bigger models and running tensor-parallel inference with more cards
+            - name: shm
+              emptyDir:
+                medium: Memory
+                sizeLimit: "2Gi"
+      EOF
+      ```
+
+3. To test your InferenceService, refer to the [KServe documentation on model inference with vLLM](https://kserve.github.io/website/latest/modelserving/v1beta1/llm/huggingface/text_generation/#perform-model-inference_1).