This document describes configuration options available when deploying Intel® AI for Enterprise RAG.
- Configuration File
- Pipeline Configuration
- Multi-Node Support and Storage Requirements
- Changing AI Models
- Performance Tuning Tips
- Accuracy Tuning Tips
- Balloons - Topology-Aware Resource Scheduling
- Horizontal Pod Autoscaling (HPA)
- Additional Configuration Options
Multiple options can be changed in the configuration file when deploying Intel® AI for Enterprise RAG.
For a complete reference of all available configuration options, see the sample configuration file: inventory/sample/config.yaml
Pipelines are the core component of Intel® AI for Enterprise RAG, defining the AI workflow and resource allocation. This is the most important configuration section to understand.
Pipelines define the complete AI workflow including:
- Microservices: Individual components (LLM, embedding, retrieval, reranking, etc.)
- Resource definitions: CPU, memory, and storage requirements for each service
- Model configurations: Which AI models to use for each service
- Service connections: How components communicate with each other
Default: CPU-optimized ChatQA pipeline
ChatQA Pipeline:
pipelines:
- namespace: chatqa # Default: chatqa
samplePath: chatqa/reference-cpu.yaml # Default: CPU reference
resourcesPath: chatqa/resources-reference-cpu.yaml # Default: CPU resources
modelConfigPath: chatqa/resources-model-cpu.yaml # Default: CPU models
type: chatqa # Default: chatqaDocument Summarization (Docsum) Pipeline:
The sample configuration file for Docsum is available at inventory/sample/config_docsum.yaml.
# Configuration for Docsum
pipelines:
- namespace: docsum # Namespace: docsum
samplePath: docsum/reference-cpu.yaml # CPU reference for Docsum
resourcesPath: docsum/resources-reference-cpu.yaml # CPU resources for Docsum
modelConfigPath: chatqa/resources-model-cpu.yaml # CPU models (shared with ChatQA)
type: docsum # Pipeline type: docsumFor Intel Gaudi AI accelerators:
ChatQA Pipeline:
gaudi_operator: true # Default: false
habana_driver_version: "1.22.1-6"
pipelines:
- namespace: chatqa
samplePath: chatqa/reference-hpu.yaml
resourcesPath: chatqa/resources-reference-hpu.yaml
modelConfigPath: chatqa/resources-model-hpu.yaml
type: chatqaDocsum Pipeline:
gaudi_operator: true # Default: false
habana_driver_version: "1.22.1-6"
pipelines:
- namespace: docsum
samplePath: docsum/reference-hpu.yaml
resourcesPath: docsum/resources-reference-hpu.yaml
modelConfigPath: chatqa/resources-model-hpu.yaml
type: docsumExternal inference endpoint with OpenAI compatible API can be also used:
pipelines:
- namespace: chatqa
samplePath: chatqa/reference-external-endpoint.yaml
resourcesPath: chatqa/resources-reference-external-endpoint.yaml
modelConfigPath: chatqa/resources-model-cpu.yaml
type: chatqaThis requires additional configuration in reference-external-endpoint.yaml in llm step. I. e.
- name: Llm
data: $response
dependency: Hard
internalService:
serviceName: llm-svc
config:
endpoint: /v1/chat/completions
LLM_MODEL_SERVER: vllm
LLM_MODEL_SERVER_ENDPOINT: example.com
LLM_MODEL_NAME: model-nameThis supports two types of authentication:
- OAuth
- Api key
Refer to the llm-usvc-readme for configuration.
Each pipeline uses three configuration files:
- Sample Path (
reference-cpu.yaml): Defines pipeline structure and service connections - Resources Path (
resources-reference-cpu.yaml): Sets CPU, memory, and storage limits - Model Config Path (
resources-model-cpu.yaml): Specifies model loading parameters
Note
To reduce vLLM resource usage, you can reduce the number of CPUs in your inventory configuration. Keep in mind that vLLM needs to be within a single NUMA node for optimal performance.
Note
Storage configuration is part of infrastructure setup and should be configured before application deployment.
Default: install_csi: "local-path-provisioner" (single-node only)
For multi-node Kubernetes clusters, you need storage supporting ReadWriteMany (RWX) access mode.
You can use any CSI driver that supports storageClass with RWX. If you don't have such a CSI driver on your K8s cluster, you can install it by following the Infrastructure Components Guide, which provides options for NFS and NetApp Trident storage drivers.
Critical: Intel® AI for Enterprise RAG only works if your chosen storage class is set as the default. Verify this before deployment:
# Check current default storage class
kubectl get storageclass
# Look for one marked with (default)
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-path rancher.io/local-path Delete WaitForFirstConsumer false 5d
nfs-csi (default) nfs.csi.k8s.io Delete Immediate false 2d
# If your desired storage class is not default, set it:
kubectl patch storageclass <your-storage-class-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
# Remove default from other storage classes if needed:
kubectl patch storageclass <other-storage-class> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'install_csi: "nfs" # Default: "local-path-provisioner"
nfs_node_name: "master-1"
nfs_host_path: "/opt/nfs-data"install_csi: "netapp-trident"
# Configure ONTAP backend settings
ontap_management_lif: "10.0.0.100"
ontap_svm: "your-svm-name"
# ... additional NetApp configurationFor detailed NetApp Trident configuration instructions, see: NetApp Trident Integration Guide
install_csi: "local-path-provisioner" # DefaultDefault: ReadWriteOnce
If you are working on multi node cluster change accessMode to ReadWriteMany
# Optional: Customize PVC settings
gmc:
pvc:
accessMode: "ReadWriteMany" # Default: uses storage class default
models:
llm_model:
name: "llm-pvc"
storage: "50Gi"
accessMode: "ReadWriteMany"
embedding_model:
name: "embedding-pvc"
storage: "20Gi"Important: PVCs will only work correctly if your storage class supports the required access mode and is set as default.
Intel® AI for Enterprise RAG provides the ability to change LLM, embedding, and reranking models to suit your specific requirements and performance needs.
Default: casperhansen/llama-3-8b-instruct-awq
For CPU deployments, the list of supported models is available in: deployment/pipelines/chatqa/resources-model-cpu.yaml
For Gaudi deployments, the list of supported models is available in: deployment/pipelines/chatqa/resources-model-hpu.yaml
# Default for CPU deployments
llm_model: "casperhansen/llama-3-8b-instruct-awq"
# Default for Gaudi deployments
llm_model_gaudi: "mistralai/Mixtral-8x7B-Instruct-v0.1"Important:
- Ensure your Hugging Face token has access to gated models like Llama
- When changing LLM models, you might need to update the prompt template as each model might have their own suggested prompt format described on their Hugging Face model page. Changing prompt might affect multilingual models, as the prompt template is written in English.
Default: BAAI/bge-base-en-v1.5
# Default
embedding_model_name: "BAAI/bge-base-en-v1.5"Important:
- Different embedding models have different vector dimensions
- Check the vector dimensions length in the embedding model description on Hugging Face
- Update the
vector_dimssetting in thevector_databasessection of your inventory configuration to match the embedding model's output dimensions. Value ofvector_dimsmust matchhidden_sizein config.json of the model e.g. https://huggingface.co/BAAI/bge-base-en-v1.5/blob/main/config.json#L11
Example configuration update:
# In your inventory config file
vector_databases:
enabled: true
namespace: vdb
vector_store: redis-cluster
vector_datatype: FLOAT32
vector_dims: 768 # Update this to match your embedding model's dimensionsDefault: BAAI/bge-reranker-base
# Default
reranking_model_name: "BAAI/bge-reranker-base"When changing models, consider:
- Resource requirements: Larger models need more CPU/memory/storage
- Vector dimensions: Embedding models must match your vector database configuration
- Prompt templates: LLM models may require specific prompt formats
- Licensing: Some models require acceptance of license agreements on Hugging Face
- Performance: Balance between accuracy and inference speed
For comprehensive performance optimization guidance, see: Performance Tuning Guide
For advanced techniques to improve retrieval and answer accuracy, see: Accuracy Tuning Guide
Default: Enabled (balloons.enabled: true)
Balloons enables CPU pinning and NUMA-aware scheduling for optimal performance:
# Default configuration
balloons:
enabled: true
namespace: kube-system # alternatively, set custom namespace for balloons
wait_timeout: 300 # timeout in seconds to wait for nri-plugin to be in ready state
throughput_mode: true # set to true to optimize for horizontal scaling
memory_overcommit_buffer: 0.1 # buffer (% of total memory) for pods using more memory than initially requested
# vllm_custom_name: "kserve-container" # Optional: Custom container name for external vLLMBenefits:
- CPU pinning for consistent performance
- NUMA-aware scheduling
- Reduced context switching
- Better cache locality
External vLLM Support:
The vllm_custom_name option allows you to pin CPU cores to external vLLM instances running within the same Kubernetes cluster. This is particularly useful when integrating with third-party AI platforms that deploy their own vLLM containers.
For example, Nutanix AI uses the container name kserve-container. To find the correct container name for your external vLLM deployment:
# Describe the pod running vLLM
kubectl describe pod <vllm-pod-name> -n <namespace>
# Look for the container name under spec.containers[].nameWhen configured, the NRI balloons policy will manage CPU resources for external vLLM instances specified by vllm_custom_name.
Important Deployment Considerations:
External vLLM instances typically require more CPU resources than other containers managed by the balloons policy. To ensure optimal performance and proper NUMA node isolation:
- Preview available resources: Before deploying external vLLM, run the topology preview to see available CPU resources on each node:
ansible-playbook -u $USER -K playbooks/application.yaml \
--tags topology-preview \
-e @inventory/sample/config.yaml
Example output showing available resources per node:
Node: localhost
Inference Groups: 2
Adjusted VLLM Size: 16
Calculation Method: throughput_mode_adjustment
Gaudi: False
AMX Supported: True
Inference memory request: 14.3%
VLLM Replicas: 2
VLLM CPU Size: 16
Embedding Replicas: 2
Embedding CPU Size: 4
Reranking Replicas: 2
Reranking CPU Size: 4
Use this information to determine the optimal number of vLLM replicas and their CPU allocation. Your maximum pool avaliable for vLLM will be VLLM Replicas multiplied by VLLM CPU Size. In this case it will be 32 vCPU.
- Deploy external vLLM first: Deploy your external vLLM instances with the proper number of replicas before deploying Intel® AI for Enterprise RAG
- Configure replicas appropriately: Ensure vLLM replicas are distributed to allow each instance to fit within a single NUMA node
- Deploy Intel® AI for Enterprise RAG second: After vLLM is running, deploy the RAG solution so the balloons policy can allocate remaining resources efficiently
Note: During the Intel® AI for Enterprise RAG deployment, the external vLLM containers will be automatically restarted to apply the balloons policy and ensure proper CPU pinning. This is a necessary step to enable NUMA-aware resource allocation for the external vLLM instances.
This deployment order ensures that the resource-intensive vLLM containers are properly isolated on NUMA nodes, and the balloons policy can then optimally allocate the remaining CPU resources to Intel® AI for Enterprise RAG components.
For detailed information, refer to: Balloons Policy Overview
Default: Enabled (hpaEnabled: true)
HPA automatically scales pods based on CPU/memory utilization:
# Default: enabled
hpaEnabled: trueWhat it does:
- Monitors resource usage across pods
- Automatically scales replicas up during high load
- Scales down when load decreases
- Ensures optimal performance during varying workloads
When to disable:
- Fixed workload scenarios
- When you prefer manual scaling control
- Resource-constrained environments where scaling isn't beneficial
The Document Summarization pipeline provides capabilities to generate summaries of documents. The pipeline processes documents through a sequence of microservices including TextExtractor, TextCompression, TextSplitter, and generates summaries using LLM services.
Configuration File: deployment/inventory/sample/config_docsum.yaml
Pipeline Definition: deployment/pipelines/docsum/
To test the Docsum pipeline after deployment:
./scripts/test_docsum.shFor more details about the Docsum pipeline architecture and available configurations, refer to the Docsum Pipeline README.
Note
Preview Status – not integrated into UI. This is a preview pipeline and is currently in active development. While core functionality is in place, it is not yet integrated into the RAG UI, and development and validation efforts are still ongoing.
This pipeline provides language translation capabilities using advanced Language Models from the ALMA family, where:
- ALMA-7B-R model - recommended for CPU-based execution
- ALMA-13B-R model - recommended for Gaudi-based (Habana) acceleration
To test the translation pipeline, first deploy it by following the instructions in Deployment Options → Installation, using a configuration file based on deployment/inventory/sample/config_language_translation.yaml.
Once deployed, run the provided shell script:
./scripts/test_translation.shAudioQnA is a solution that enables voice-based question answering, combining automatic speech recognition (ASR) and text-to-speech (TTS) capabilities with the ChatQnA pipeline. This allows users to ask questions using voice input and receive audio responses.
Enabling AudioQnA:
To enable AudioQnA functionality for testing with ChatQnA solutions, you need to set audio.enabled to true in your configuration file:
audio:
enabled: true # Default: false
namespace: audio # Default: audio
asr_model: "openai/whisper-small" # Automatic Speech Recognition model
tts_model: "microsoft/speecht5_tts" # Text-to-Speech modelConfiguration Options:
enabled: Set totrueto deploy AudioQnA components alongside ChatQnAnamespace: Kubernetes namespace where audio services will be deployedasr_model: Model used for converting speech to text (Automatic Speech Recognition). Checkout the microservice page for more information.tts_model: Model used for converting text to speech (Text-to-Speech). Checkout the microservice page for more information.
Requirements:
- AudioQnA works with ChatQnA pipelines (type:
chatqa) - Audio services are deployed in a separate namespace but integrate with your ChatQnA solution
- Both ASR and TTS models are downloaded and deployed when audio is enabled
Note
AudioQnA adds additional resource requirements to your deployment. Ensure your cluster has sufficient resources for both the ChatQnA pipeline and the audio processing services.
Intel® AI for Enterprise RAG supports multiple vector database backends for storing and retrieving embeddings. You can select the appropriate database based on your deployment scale and requirements.
Available Options:
-
redis-cluster - Multi-node Redis cluster with distributed vector search
- Best for: Production deployments with large-scale data (1M+ vectors)
- Features: High availability, horizontal scalability, distributed hash slots
-
mssql - Microsoft SQL Server 2025 Express Edition with vector support
- Best for: Organizations already using Microsoft SQL Server ecosystem
- Features: SQL-based vector operations, familiar SQL interface
- Important: Requires accepting Microsoft SQL Server EULA during deployment
- Limitations: Role-based access control (RBAC) for vector databases is not supported with
mssql, so be sure to setedp.rbac.enabledto false in config.yaml
Configuration:
Modify the vector_store parameter in your inventory configuration file (config.yaml):
vector_databases:
enabled: true
namespace: vdb
vector_store: redis-cluster # Options: redis, redis-cluster, mssql
vector_datatype: FLOAT32
vector_dims: 768 # Must match your embedding model dimensionsMicrosoft SQL Server EULA Acceptance:
If you select mssql as your vector store, the deployment will pause and prompt you to accept the Microsoft SQL Server terms:
[vector_databases : Ask the operator to accept the EULA]
Do you accept the Microsoft SQL Server 2025 Express Edition EULA? [Y/N]
Type Y to accept, N to decline. Press ENTER to confirm.Press Y and then ENTER to accept and continue. The deployment will not proceed without EULA acceptance.
Additional Resources:
For detailed information about each vector database implementation, storage configuration, monitoring, and operational guidance, refer to:
Note
The default settings are suitable for smaller deployments only (by default, approximately 5GB of data).
You can expand the storage configuration for both the Vector Store and SeaweedFS deployments by modifying their respective configurations:
If using EDP, update the deployment/edp/values.yaml file to increase the storage size under the persistence section. For example, set size: 100Gi to allocate 100GB of storage.
Similarly, for the selected Vector Store, you can increase the persistent storage size. This configuration is available in deployment/components/vector_databases/values.yaml. For example, set persistence.size: 100Gi to allocate 100GB of storage for Vector Store database data.
Note
The Vector Store storage should have more storage than file storage due to containing both extracted text and vector embeddings for that data.
By default, the EDP storage type is set to SeaweedFS, which deploys SeaweedFS in-cluster. For additional options, refer to the EDP documentation.
When using an external S3-compatible storage backend (e.g. NetApp ONTAP with the netapp-trident CSI driver), web browsers cannot reach the ONTAP data_lif directly because it is an internal network interface. The reverse_proxy_storage option instructs the ingress component to create a reverse proxy that routes browser file-upload traffic through the cluster ingress to the storage backend.
# Default: false
reverse_proxy_storage: true # Enable ingress reverse proxy for external S3 storageWhen reverse_proxy_storage: true the ingress exposes the storage endpoint at s3.<FQDN> (e.g. s3.erag.com). You must also configure the EDP component to use this ingress hostname as its external URL, because the ONTAP data_lif is not reachable from outside the cluster:
edp:
enabled: true
rbac:
enabled: false # Must be disabled when using ONTAP
storageType: s3compatible
s3compatible:
internalUrl: "https://<ontap_data_lif>" # Used by in-cluster pods
externalUrl: "https://s3.<your-fqdn>" # Used by web browsers via ingressNote
The s3.<FQDN> hostname must be registered in your DNS and added to your inventory
as an ingress host. The deployment will print a warning during pre-install if
reverse_proxy_storage is enabled but ontap_data_lif is not set.
For full NetApp ONTAP setup instructions see the NetApp Trident CSI Integration guide.
Intel® AI for Enterprise RAG includes the installation of a telemetry stack by default, which requires setting the number of iwatch open descriptors on each cluster host. For more information, follow the instructions in Number of iwatch open descriptors.
Pod Security Standards (PSS) and certificate configuration for secure deployments:
# Defaults
enforcePSS: true # Default: true (Pod Security Standards enabled)
certs:
autoGenerated: true # Default: true (self-signed certificates)
pathToCert: "" # Default: empty (auto-generated)
pathToKey: "" # Default: empty (auto-generated)Pod Security Standards (PSS):
- When
enforcePSS: true, namespaces are automatically labeled as "restricted" or "privileged" based on their security requirements - Restricted namespaces enforce stricter security policies (no privileged containers, restricted capabilities)
- Privileged namespaces allow containers with elevated permissions when necessary
- This helps maintain security compliance across the Kubernetes cluster
Certificate Configuration:
autoGenerated: true: Uses self-signed certificates for HTTPS endpointspathToCertandpathToKey: Specify custom SSL certificate and private key paths for production deployments- Custom certificates are recommended for production environments to avoid browser security warnings
Intel® Trust Domain Extensions (Intel® TDX) provides hardware-based trusted execution environments for confidential computing:
# Default: disabled (experimental feature)
tdx:
enabled: false # Default: false
td_type: "one-td" # Default: "one-td"
attestation:
enabled: false # Default: falseConfiguration Options:
enabled: Enables Intel TDX protection for microservicestd_type: Deployment type - "one-td" (single Trust Domain) or "coco" (Confidential Containers)attestation.enabled: Enables TDX-based remote attestation for verification
Requirements:
- 4th Gen Intel® Xeon® Scalable processors or later
- Ubuntu 24.04 with TDX enabled
- Compatible Kubernetes version (1.31+)
Only enable TDX if you have compatible Intel hardware and understand the experimental nature of this feature. For detailed TDX deployment instructions, see: TDX Deployment Guide
# Defaults
registry: "docker.io/opea" # Default: public OPEA registry
tag: "1.5.0" # Default: current release tag
local_registry: false # Default: false (use public registry)Intel® AI for Enterprise RAG provides the ability to build Docker images locally instead of using pre-built images from public registries. This is particularly useful for:
- Custom modifications to microservices or components
- Security requirements that mandate locally built and verified images
- Development and testing of custom pipeline modifications
For single node clusters, you can use the --setup-registry option in the update_images.sh script described in the Building Images Guide.
local_registry: false # Use script-based registry setupFor multi-node clusters, if you want to build your own images, you need to build a local registry accessible from the cluster. This can be done by setting:
local_registry: true # Enable local registry pod
insecure_registry: "<node-name>:32000" # Registry endpoint accessible from clusterWhere <node-name> is the Kubernetes node name where you want to deploy the registry pod. You can check available node names with:
kubectl get nodesThis option will create a Kubernetes pod with registry functionality and configure Docker and containerd settings to be able to push and pull images to the Kubernetes pod with registry.
Note
Installation of the local registry pod is performed by running the infrastructure.yaml playbook. For detailed instructions, see the Infrastructure Components Guide.
For detailed instructions on building images locally, including prerequisites, build processes, and troubleshooting, refer to the Building Images Guide.