This guide provides recommendations for optimizing the performance of your Intel® AI for Enterprise RAG deployment.
- System Configuration Tips
- Component Scaling
- Runtime Parameter Tuning
- Horizontal Pod Autoscaling
- Balloons Policy
- Monitoring and Validation
- To modify the LLM model, change the
llm_modelin config.yaml before deploying the pipeline. - All supported LLM models are listed here.
# Example configuration
llm_model: "casperhansen/llama-3-8b-instruct-awq"- Intel® AI for Enterprise RAG supports following vector database backends. Choose based on your deployment scale:
- redis-cluster: Multi-node cluster for production and large-scale deployments (1M+ vectors)
- mssql: Microsoft SQL Server 2025 Express Edition for SQL-based vector operations
- Modify the
vector_storeparameter in config.yaml:
vector_databases:
enabled: true
namespace: vdb
vector_store: redis-cluster # Options: redis-cluster, mssqlNote
Selecting mssql requires accepting the Microsoft SQL Server EULA during deployment. Also, role-based access control (RBAC) for vector databases is not supported when using mssql. For detailed information about vector database options, see Vector Database Selection.
Starting with Redis 8.2, use SVS-VAMANA as the recommended default vector index backend for Enterprise RAG deployments, especially for medium/large datasets. It is optimized for better memory efficiency and query throughput while keeping high recall.
This is configurable via deployment/inventory/**/config.yaml as follows:
edp:
ingestion:
config:
vector_algorithm: "SVS-VAMANA"
vector_datatype: "FLOAT32"
vector_distance_metric: "COSINE"If your priority is faster index build time or compatibility with existing tuning profiles, HNSW remains a valid alternative.
For detailed trade-offs and parameter tuning, see Redis documentation:
- Vector indexes overview: https://redis.io/docs/latest/develop/ai/search-and-query/vectors/
- SVS-VAMANA reference and parameters: https://redis.io/docs/latest/develop/ai/search-and-query/vectors/#svs-vamana-index
- SVS compression and tuning options: https://redis.io/docs/latest/develop/ai/search-and-query/vectors/svs-compression/
- SVS-VAMANA perf comparison details: https://redis.io/blog/tech-dive-comprehensive-compression-leveraging-quantization-and-dimensionality-reduction/
Note that changing index settings can require additional RAM and storage for the vector database, since new indexes may be created before old ones are removed. This operation might be time-consuming for large datasets.
Ensure the Redis instances have enough resources assigned, both from compute and storage. This is configurable via deployment/inventory/**/config.yaml as follows:
vector_databases:
enabled: true
namespace: vdb
vector_store: redis-cluster
redis-cluster:
persistence:
size: "30Gi"
resources:
requests:
cpu: 8
memory: 16Gi
limits:
cpu: 16
memory: 128GiIn case of redis-cluster, all above settings are applied for each cluster node.
- Match the number of TeiRerank replicas to the number of CPU sockets on your machine for optimal performance.
- Adjust parameters in resources-reference-cpu.yaml.
# Example for a 2-socket system
teirerank:
replicas: 2 # Set to number of CPU socketsNote
Automatic Configuration: When Balloons Policy is enabled (balloons.enabled: true in config.yaml), the system automatically discovers node topology and calculates optimal VLLM replica distribution. Manual configuration is only required when balloons.enabled: false.
Manual Configuration:
- For machines with ≤64 physical cores per socket: use 1 replica per socket
- For machines with >64 physical cores per socket (e.g., 96 or 128): use 2 replicas per socket
- Adjust in resources-reference-cpu.yaml.
# Example for a 2-socket system with ≤64 cores per socket
vllm:
replicas: 2 # 1 replica per socket × 2 sockets# Example for a 2-socket system with >64 cores per socket
vllm:
replicas: 4 # 2 replicas per socket × 2 sockets- Additionally, if your machine has less than 32 physical cores per NUMA node, you need to reduce the number of CPU cores for vLLM:
# Example for system with only 24 cores per NUMA node
vllm:
replicas: 1
resources:
requests:
cpu: 24
memory: 64Gi
limits:
cpu: 24
memory: 100GiNote
Performance Tip: Consider enabling Sub-NUMA Clustering (SNC) in BIOS for better VLLM performance. This helps optimize memory access patterns across NUMA nodes.
- When running more than one vLLM instance and when system is accessed by multiple concurrent users (e.g., 64+ users) use at least 2 replicas of llm-usvc.
- Adjust parameters in resources-reference-cpu.yaml.
llm-usvc:
replicas: 2You can adjust microservice parameters (e.g., top_k for reranker, k for retriever, max_new_tokens for LLM) using one of these methods:
-
Using the Admin Panel UI:
- Navigate to the Admin Panel section in the UI
- Find detailed instructions in UI features
-
Using Configuration Scripts:
- Utilize the helper scripts
Warning
Only parameters that don't require a microservice restart can be adjusted at runtime.
- Consider enabling HPA in order to allow the system to dynamically scale required resources in cluster.
- HPA can be enabled in config.yaml:
hpaEnabled: true- Balloons Policy is responsible for assigning optimal resources for inference pods such as vLLM, embedding, reranking and it is crucial for the performance of the whole deployment.
- It can be enabled in config.yaml:
balloons:
enabled: true
namespace: kube-system # alternatively, set custom namespace for balloons
wait_timeout: 300 # timeout in seconds to wait for nri-plugin to be in ready state
throughput_mode: true # set to true to optimize for horizontal scaling
memory_overcommit_buffer: 0.1 # buffer (% of total memory) for pods using more memory than initially requestedAfter making performance tuning changes, monitor system performance using:
- The built-in metrics dashboard
- Load testing with sample queries
- Memory and CPU utilization metrics
This will help validate that your changes have had the desired effect.