Skip to content

Latest commit

 

History

History
677 lines (492 loc) · 28.9 KB

File metadata and controls

677 lines (492 loc) · 28.9 KB

Advanced Configuration Guide

This document describes configuration options available when deploying Intel® AI for Enterprise RAG.

Table of Contents

  1. Configuration File
  2. Pipeline Configuration
    1. What are Pipelines?
    2. Default Pipeline Configuration
    3. Switching to Gaudi (HPU) Pipeline
    4. Using external inference endpoint
    5. Resource Configuration Files
  3. Multi-Node Support and Storage Requirements
    1. Checking Your Default Storage Class
    2. Storage Options
    3. Persistent Volume Claims (PVC) Configuration
  4. Changing AI Models
    1. LLM (Large Language Model)
    2. Embedding Model
    3. Reranking Model
    4. Model Compatibility Notes
  5. Performance Tuning Tips
  6. Accuracy Tuning Tips
  7. Balloons - Topology-Aware Resource Scheduling
  8. Horizontal Pod Autoscaling (HPA)
  9. Additional Configuration Options
    1. Additional Pipelines
    2. Vector Database Selection
    3. Vector Store Database Storage Settings
    4. EDP Storage Types
    5. Reverse Proxy for External S3 Storage (NetApp ONTAP)
    6. Additional Settings for Running Telemetry
    7. Security Settings
    8. Trust Domain Extensions (TDX)
    9. Registry Configuration
    10. Local Image Building

Configuration File

Multiple options can be changed in the configuration file when deploying Intel® AI for Enterprise RAG.

For a complete reference of all available configuration options, see the sample configuration file: inventory/sample/config.yaml

Pipeline Configuration

Pipelines are the core component of Intel® AI for Enterprise RAG, defining the AI workflow and resource allocation. This is the most important configuration section to understand.

What are Pipelines?

Pipelines define the complete AI workflow including:

  • Microservices: Individual components (LLM, embedding, retrieval, reranking, etc.)
  • Resource definitions: CPU, memory, and storage requirements for each service
  • Model configurations: Which AI models to use for each service
  • Service connections: How components communicate with each other

Default Pipeline Configuration

Default: CPU-optimized ChatQA pipeline

ChatQA Pipeline:

pipelines:
  - namespace: chatqa                                      # Default: chatqa
    samplePath: chatqa/reference-cpu.yaml                 # Default: CPU reference
    resourcesPath: chatqa/resources-reference-cpu.yaml    # Default: CPU resources
    modelConfigPath: chatqa/resources-model-cpu.yaml      # Default: CPU models
    type: chatqa                                          # Default: chatqa

Document Summarization (Docsum) Pipeline:

The sample configuration file for Docsum is available at inventory/sample/config_docsum.yaml.

# Configuration for Docsum
pipelines:
  - namespace: docsum                                      # Namespace: docsum
    samplePath: docsum/reference-cpu.yaml                 # CPU reference for Docsum
    resourcesPath: docsum/resources-reference-cpu.yaml    # CPU resources for Docsum
    modelConfigPath: chatqa/resources-model-cpu.yaml      # CPU models (shared with ChatQA)
    type: docsum                                          # Pipeline type: docsum

Switching to Gaudi (HPU) Pipeline

For Intel Gaudi AI accelerators:

ChatQA Pipeline:

gaudi_operator: true              # Default: false
habana_driver_version: "1.22.1-6"

pipelines:
  - namespace: chatqa
    samplePath: chatqa/reference-hpu.yaml
    resourcesPath: chatqa/resources-reference-hpu.yaml
    modelConfigPath: chatqa/resources-model-hpu.yaml
    type: chatqa

Docsum Pipeline:

gaudi_operator: true              # Default: false
habana_driver_version: "1.22.1-6"

pipelines:
  - namespace: docsum
    samplePath: docsum/reference-hpu.yaml
    resourcesPath: docsum/resources-reference-hpu.yaml
    modelConfigPath: chatqa/resources-model-hpu.yaml
    type: docsum

Using external inference endpoint

External inference endpoint with OpenAI compatible API can be also used:

pipelines:
  - namespace: chatqa
    samplePath: chatqa/reference-external-endpoint.yaml
    resourcesPath: chatqa/resources-reference-external-endpoint.yaml
    modelConfigPath: chatqa/resources-model-cpu.yaml
    type: chatqa

This requires additional configuration in reference-external-endpoint.yaml in llm step. I. e.

      - name: Llm
        data: $response
        dependency: Hard
        internalService:
          serviceName: llm-svc
          config:
            endpoint: /v1/chat/completions
            LLM_MODEL_SERVER: vllm
            LLM_MODEL_SERVER_ENDPOINT: example.com
            LLM_MODEL_NAME: model-name

This supports two types of authentication:

  • OAuth
  • Api key

Refer to the llm-usvc-readme for configuration.

Resource Configuration Files

Each pipeline uses three configuration files:

  1. Sample Path (reference-cpu.yaml): Defines pipeline structure and service connections
  2. Resources Path (resources-reference-cpu.yaml): Sets CPU, memory, and storage limits
  3. Model Config Path (resources-model-cpu.yaml): Specifies model loading parameters

Note

To reduce vLLM resource usage, you can reduce the number of CPUs in your inventory configuration. Keep in mind that vLLM needs to be within a single NUMA node for optimal performance.

Multi-Node Support and Storage Requirements

Note

Storage configuration is part of infrastructure setup and should be configured before application deployment.

Default: install_csi: "local-path-provisioner" (single-node only)

For multi-node Kubernetes clusters, you need storage supporting ReadWriteMany (RWX) access mode.

You can use any CSI driver that supports storageClass with RWX. If you don't have such a CSI driver on your K8s cluster, you can install it by following the Infrastructure Components Guide, which provides options for NFS and NetApp Trident storage drivers.

Checking Your Default Storage Class

Critical: Intel® AI for Enterprise RAG only works if your chosen storage class is set as the default. Verify this before deployment:

# Check current default storage class
kubectl get storageclass

# Look for one marked with (default)
NAME                 PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path           rancher.io/local-path          Delete          WaitForFirstConsumer   false                  5d
nfs-csi (default)    nfs.csi.k8s.io                 Delete          Immediate              false                  2d

# If your desired storage class is not default, set it:
kubectl patch storageclass <your-storage-class-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

# Remove default from other storage classes if needed:
kubectl patch storageclass <other-storage-class> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'

Storage Options

NFS Storage (Recommended for Multi-Node)

install_csi: "nfs"                # Default: "local-path-provisioner"
nfs_node_name: "master-1"
nfs_host_path: "/opt/nfs-data"

NetApp Trident (Enterprise Storage)

install_csi: "netapp-trident"
# Configure ONTAP backend settings
ontap_management_lif: "10.0.0.100"
ontap_svm: "your-svm-name"
# ... additional NetApp configuration

For detailed NetApp Trident configuration instructions, see: NetApp Trident Integration Guide

Local Storage (Single-Node Only)

install_csi: "local-path-provisioner"  # Default

Persistent Volume Claims (PVC) Configuration

Default: ReadWriteOnce

If you are working on multi node cluster change accessMode to ReadWriteMany

# Optional: Customize PVC settings
gmc:
  pvc:
    accessMode: "ReadWriteMany"        # Default: uses storage class default
    models:
      llm_model:
        name: "llm-pvc"
        storage: "50Gi"
        accessMode: "ReadWriteMany"
      embedding_model:
        name: "embedding-pvc" 
        storage: "20Gi"

Important: PVCs will only work correctly if your storage class supports the required access mode and is set as default.

Changing AI Models

Intel® AI for Enterprise RAG provides the ability to change LLM, embedding, and reranking models to suit your specific requirements and performance needs.

LLM (Large Language Model)

Default: casperhansen/llama-3-8b-instruct-awq

For CPU deployments, the list of supported models is available in: deployment/pipelines/chatqa/resources-model-cpu.yaml

For Gaudi deployments, the list of supported models is available in: deployment/pipelines/chatqa/resources-model-hpu.yaml

# Default for CPU deployments
llm_model: "casperhansen/llama-3-8b-instruct-awq"

# Default for Gaudi deployments  
llm_model_gaudi: "mistralai/Mixtral-8x7B-Instruct-v0.1"

Important:

  • Ensure your Hugging Face token has access to gated models like Llama
  • When changing LLM models, you might need to update the prompt template as each model might have their own suggested prompt format described on their Hugging Face model page. Changing prompt might affect multilingual models, as the prompt template is written in English.

Embedding Model

Default: BAAI/bge-base-en-v1.5

# Default
embedding_model_name: "BAAI/bge-base-en-v1.5"

Important:

  • Different embedding models have different vector dimensions
  • Check the vector dimensions length in the embedding model description on Hugging Face
  • Update the vector_dims setting in the vector_databases section of your inventory configuration to match the embedding model's output dimensions. Value of vector_dims must match hidden_size in config.json of the model e.g. https://huggingface.co/BAAI/bge-base-en-v1.5/blob/main/config.json#L11

Example configuration update:

# In your inventory config file
vector_databases:
  enabled: true
  namespace: vdb
  vector_store: redis-cluster
  vector_datatype: FLOAT32
  vector_dims: 768  # Update this to match your embedding model's dimensions

Reranking Model

Default: BAAI/bge-reranker-base

# Default
reranking_model_name: "BAAI/bge-reranker-base"

Model Compatibility Notes

When changing models, consider:

  • Resource requirements: Larger models need more CPU/memory/storage
  • Vector dimensions: Embedding models must match your vector database configuration
  • Prompt templates: LLM models may require specific prompt formats
  • Licensing: Some models require acceptance of license agreements on Hugging Face
  • Performance: Balance between accuracy and inference speed

Performance Tuning Tips

For comprehensive performance optimization guidance, see: Performance Tuning Guide

Accuracy Tuning Tips

For advanced techniques to improve retrieval and answer accuracy, see: Accuracy Tuning Guide

Balloons - Topology-Aware Resource Scheduling

Default: Enabled (balloons.enabled: true)

Balloons enables CPU pinning and NUMA-aware scheduling for optimal performance:

# Default configuration
balloons:
  enabled: true
  namespace: kube-system # alternatively, set custom namespace for balloons
  wait_timeout: 300 # timeout in seconds to wait for nri-plugin to be in ready state
  throughput_mode: true # set to true to optimize for horizontal scaling
  memory_overcommit_buffer: 0.1 # buffer (% of total memory) for pods using more memory than initially requested
  # vllm_custom_name: "kserve-container" # Optional: Custom container name for external vLLM

Benefits:

  • CPU pinning for consistent performance
  • NUMA-aware scheduling
  • Reduced context switching
  • Better cache locality

External vLLM Support:

The vllm_custom_name option allows you to pin CPU cores to external vLLM instances running within the same Kubernetes cluster. This is particularly useful when integrating with third-party AI platforms that deploy their own vLLM containers.

For example, Nutanix AI uses the container name kserve-container. To find the correct container name for your external vLLM deployment:

# Describe the pod running vLLM
kubectl describe pod <vllm-pod-name> -n <namespace>

# Look for the container name under spec.containers[].name

When configured, the NRI balloons policy will manage CPU resources for external vLLM instances specified by vllm_custom_name.

Important Deployment Considerations:

External vLLM instances typically require more CPU resources than other containers managed by the balloons policy. To ensure optimal performance and proper NUMA node isolation:

  1. Preview available resources: Before deploying external vLLM, run the topology preview to see available CPU resources on each node:
ansible-playbook -u $USER -K playbooks/application.yaml \
  --tags topology-preview \
  -e @inventory/sample/config.yaml

Example output showing available resources per node:

Node: localhost
  Inference Groups: 2
  Adjusted VLLM Size: 16
  Calculation Method: throughput_mode_adjustment
  Gaudi: False
  AMX Supported: True
  Inference memory request: 14.3%
  VLLM Replicas: 2
  VLLM CPU Size: 16
  Embedding Replicas: 2
  Embedding CPU Size: 4
  Reranking Replicas: 2
  Reranking CPU Size: 4

Use this information to determine the optimal number of vLLM replicas and their CPU allocation. Your maximum pool avaliable for vLLM will be VLLM Replicas multiplied by VLLM CPU Size. In this case it will be 32 vCPU.

  1. Deploy external vLLM first: Deploy your external vLLM instances with the proper number of replicas before deploying Intel® AI for Enterprise RAG
  2. Configure replicas appropriately: Ensure vLLM replicas are distributed to allow each instance to fit within a single NUMA node
  3. Deploy Intel® AI for Enterprise RAG second: After vLLM is running, deploy the RAG solution so the balloons policy can allocate remaining resources efficiently

Note: During the Intel® AI for Enterprise RAG deployment, the external vLLM containers will be automatically restarted to apply the balloons policy and ensure proper CPU pinning. This is a necessary step to enable NUMA-aware resource allocation for the external vLLM instances.

This deployment order ensures that the resource-intensive vLLM containers are properly isolated on NUMA nodes, and the balloons policy can then optimally allocate the remaining CPU resources to Intel® AI for Enterprise RAG components.

For detailed information, refer to: Balloons Policy Overview

Horizontal Pod Autoscaling (HPA)

Default: Enabled (hpaEnabled: true)

HPA automatically scales pods based on CPU/memory utilization:

# Default: enabled
hpaEnabled: true

What it does:

  • Monitors resource usage across pods
  • Automatically scales replicas up during high load
  • Scales down when load decreases
  • Ensures optimal performance during varying workloads

When to disable:

  • Fixed workload scenarios
  • When you prefer manual scaling control
  • Resource-constrained environments where scaling isn't beneficial

Additional Configuration Options

Additional Pipelines

Document Summarization (Docsum) Pipeline

The Document Summarization pipeline provides capabilities to generate summaries of documents. The pipeline processes documents through a sequence of microservices including TextExtractor, TextCompression, TextSplitter, and generates summaries using LLM services.

Configuration File: deployment/inventory/sample/config_docsum.yaml

Pipeline Definition: deployment/pipelines/docsum/

To test the Docsum pipeline after deployment:

./scripts/test_docsum.sh

For more details about the Docsum pipeline architecture and available configurations, refer to the Docsum Pipeline README.

Language Translation Pipeline

Note

Preview Status – not integrated into UI. This is a preview pipeline and is currently in active development. While core functionality is in place, it is not yet integrated into the RAG UI, and development and validation efforts are still ongoing.

This pipeline provides language translation capabilities using advanced Language Models from the ALMA family, where:

  • ALMA-7B-R model - recommended for CPU-based execution
  • ALMA-13B-R model - recommended for Gaudi-based (Habana) acceleration

To test the translation pipeline, first deploy it by following the instructions in Deployment Options → Installation, using a configuration file based on deployment/inventory/sample/config_language_translation.yaml.

Once deployed, run the provided shell script:

./scripts/test_translation.sh

AudioQnA Solution

AudioQnA is a solution that enables voice-based question answering, combining automatic speech recognition (ASR) and text-to-speech (TTS) capabilities with the ChatQnA pipeline. This allows users to ask questions using voice input and receive audio responses.

Enabling AudioQnA:

To enable AudioQnA functionality for testing with ChatQnA solutions, you need to set audio.enabled to true in your configuration file:

audio:
  enabled: true                                  # Default: false
  namespace: audio                               # Default: audio
  asr_model: "openai/whisper-small"             # Automatic Speech Recognition model
  tts_model: "microsoft/speecht5_tts"           # Text-to-Speech model

Configuration Options:

  • enabled: Set to true to deploy AudioQnA components alongside ChatQnA
  • namespace: Kubernetes namespace where audio services will be deployed
  • asr_model: Model used for converting speech to text (Automatic Speech Recognition). Checkout the microservice page for more information.
  • tts_model: Model used for converting text to speech (Text-to-Speech). Checkout the microservice page for more information.

Requirements:

  • AudioQnA works with ChatQnA pipelines (type: chatqa)
  • Audio services are deployed in a separate namespace but integrate with your ChatQnA solution
  • Both ASR and TTS models are downloaded and deployed when audio is enabled

Note

AudioQnA adds additional resource requirements to your deployment. Ensure your cluster has sufficient resources for both the ChatQnA pipeline and the audio processing services.

Vector Database Selection

Intel® AI for Enterprise RAG supports multiple vector database backends for storing and retrieving embeddings. You can select the appropriate database based on your deployment scale and requirements.

Available Options:

  1. redis-cluster - Multi-node Redis cluster with distributed vector search

    • Best for: Production deployments with large-scale data (1M+ vectors)
    • Features: High availability, horizontal scalability, distributed hash slots
  2. mssql - Microsoft SQL Server 2025 Express Edition with vector support

  • Best for: Organizations already using Microsoft SQL Server ecosystem
  • Features: SQL-based vector operations, familiar SQL interface
  • Important: Requires accepting Microsoft SQL Server EULA during deployment
  • Limitations: Role-based access control (RBAC) for vector databases is not supported with mssql, so be sure to set edp.rbac.enabled to false in config.yaml

Configuration:

Modify the vector_store parameter in your inventory configuration file (config.yaml):

vector_databases:
  enabled: true
  namespace: vdb
  vector_store: redis-cluster  # Options: redis, redis-cluster, mssql
  vector_datatype: FLOAT32
  vector_dims: 768  # Must match your embedding model dimensions

Microsoft SQL Server EULA Acceptance:

If you select mssql as your vector store, the deployment will pause and prompt you to accept the Microsoft SQL Server terms:

[vector_databases : Ask the operator to accept the EULA]
Do you accept the Microsoft SQL Server 2025 Express Edition EULA? [Y/N]
Type Y to accept, N to decline. Press ENTER to confirm.

Press Y and then ENTER to accept and continue. The deployment will not proceed without EULA acceptance.

Additional Resources:

For detailed information about each vector database implementation, storage configuration, monitoring, and operational guidance, refer to:

Vector Store Database Storage Settings

Note

The default settings are suitable for smaller deployments only (by default, approximately 5GB of data).

You can expand the storage configuration for both the Vector Store and SeaweedFS deployments by modifying their respective configurations:

If using EDP, update the deployment/edp/values.yaml file to increase the storage size under the persistence section. For example, set size: 100Gi to allocate 100GB of storage.

Similarly, for the selected Vector Store, you can increase the persistent storage size. This configuration is available in deployment/components/vector_databases/values.yaml. For example, set persistence.size: 100Gi to allocate 100GB of storage for Vector Store database data.

Note

The Vector Store storage should have more storage than file storage due to containing both extracted text and vector embeddings for that data.

EDP Storage Types

By default, the EDP storage type is set to SeaweedFS, which deploys SeaweedFS in-cluster. For additional options, refer to the EDP documentation.

Reverse Proxy for External S3 Storage (NetApp ONTAP)

When using an external S3-compatible storage backend (e.g. NetApp ONTAP with the netapp-trident CSI driver), web browsers cannot reach the ONTAP data_lif directly because it is an internal network interface. The reverse_proxy_storage option instructs the ingress component to create a reverse proxy that routes browser file-upload traffic through the cluster ingress to the storage backend.

# Default: false
reverse_proxy_storage: true   # Enable ingress reverse proxy for external S3 storage

When reverse_proxy_storage: true the ingress exposes the storage endpoint at s3.<FQDN> (e.g. s3.erag.com). You must also configure the EDP component to use this ingress hostname as its external URL, because the ONTAP data_lif is not reachable from outside the cluster:

edp:
  enabled: true
  rbac:
    enabled: false    # Must be disabled when using ONTAP
  storageType: s3compatible
  s3compatible:
    internalUrl: "https://<ontap_data_lif>"   # Used by in-cluster pods
    externalUrl: "https://s3.<your-fqdn>"     # Used by web browsers via ingress

Note

The s3.<FQDN> hostname must be registered in your DNS and added to your inventory as an ingress host. The deployment will print a warning during pre-install if reverse_proxy_storage is enabled but ontap_data_lif is not set.

For full NetApp ONTAP setup instructions see the NetApp Trident CSI Integration guide.

Additional Settings for Running Telemetry

Intel® AI for Enterprise RAG includes the installation of a telemetry stack by default, which requires setting the number of iwatch open descriptors on each cluster host. For more information, follow the instructions in Number of iwatch open descriptors.

Security Settings

Pod Security Standards (PSS) and certificate configuration for secure deployments:

# Defaults
enforcePSS: true                    # Default: true (Pod Security Standards enabled)
certs:
  autoGenerated: true               # Default: true (self-signed certificates)
  pathToCert: ""                    # Default: empty (auto-generated)
  pathToKey: ""                     # Default: empty (auto-generated)

Pod Security Standards (PSS):

  • When enforcePSS: true, namespaces are automatically labeled as "restricted" or "privileged" based on their security requirements
  • Restricted namespaces enforce stricter security policies (no privileged containers, restricted capabilities)
  • Privileged namespaces allow containers with elevated permissions when necessary
  • This helps maintain security compliance across the Kubernetes cluster

Certificate Configuration:

  • autoGenerated: true: Uses self-signed certificates for HTTPS endpoints
  • pathToCert and pathToKey: Specify custom SSL certificate and private key paths for production deployments
  • Custom certificates are recommended for production environments to avoid browser security warnings

Trust Domain Extensions (TDX)

Intel® Trust Domain Extensions (Intel® TDX) provides hardware-based trusted execution environments for confidential computing:

# Default: disabled (experimental feature)
tdx:
  enabled: false                    # Default: false
  td_type: "one-td"                # Default: "one-td"
  attestation:
    enabled: false                  # Default: false

Configuration Options:

  • enabled: Enables Intel TDX protection for microservices
  • td_type: Deployment type - "one-td" (single Trust Domain) or "coco" (Confidential Containers)
  • attestation.enabled: Enables TDX-based remote attestation for verification

Requirements:

  • 4th Gen Intel® Xeon® Scalable processors or later
  • Ubuntu 24.04 with TDX enabled
  • Compatible Kubernetes version (1.31+)

Only enable TDX if you have compatible Intel hardware and understand the experimental nature of this feature. For detailed TDX deployment instructions, see: TDX Deployment Guide

Registry Configuration

# Defaults
registry: "docker.io/opea"          # Default: public OPEA registry
tag: "1.5.0"                        # Default: current release tag
local_registry: false               # Default: false (use public registry)

Local Image Building

Intel® AI for Enterprise RAG provides the ability to build Docker images locally instead of using pre-built images from public registries. This is particularly useful for:

  • Custom modifications to microservices or components
  • Security requirements that mandate locally built and verified images
  • Development and testing of custom pipeline modifications

Single Node Clusters

For single node clusters, you can use the --setup-registry option in the update_images.sh script described in the Building Images Guide.

local_registry: false             # Use script-based registry setup

Multi-Node Clusters

For multi-node clusters, if you want to build your own images, you need to build a local registry accessible from the cluster. This can be done by setting:

local_registry: true                    # Enable local registry pod
insecure_registry: "<node-name>:32000"  # Registry endpoint accessible from cluster

Where <node-name> is the Kubernetes node name where you want to deploy the registry pod. You can check available node names with:

kubectl get nodes

This option will create a Kubernetes pod with registry functionality and configure Docker and containerd settings to be able to push and pull images to the Kubernetes pod with registry.

Note

Installation of the local registry pod is performed by running the infrastructure.yaml playbook. For detailed instructions, see the Infrastructure Components Guide.

For detailed instructions on building images locally, including prerequisites, build processes, and troubleshooting, refer to the Building Images Guide.