Performance Tuning Tips

This guide provides recommendations for optimizing the performance of your Intel® AI for Enterprise RAG deployment.

System Configuration Tips
Component Scaling
Runtime Parameter Tuning
Horizontal Pod Autoscaling
Balloons Policy
Monitoring and Validation

System Configuration Tips

LLM Model Selection

To modify the LLM model, change the llm_model in config.yaml before deploying the pipeline.
All supported LLM models are listed here.

# Example configuration
llm_model: "casperhansen/llama-3-8b-instruct-awq"

Vector Database Selection

Intel® AI for Enterprise RAG supports following vector database backends. Choose based on your deployment scale:
- redis-cluster: Multi-node cluster for production and large-scale deployments (1M+ vectors)
- mssql: Microsoft SQL Server 2025 Express Edition for SQL-based vector operations
Modify the vector_store parameter in config.yaml:

vector_databases:
  enabled: true
  namespace: vdb
  vector_store: redis-cluster  # Options: redis-cluster, mssql

Note

Selecting mssql requires accepting the Microsoft SQL Server EULA during deployment. Also, role-based access control (RBAC) for vector databases is not supported when using mssql. For detailed information about vector database options, see Vector Database Selection.

Redis Vector Database Performance Settings

Starting with Redis 8.2, use SVS-VAMANA as the recommended default vector index backend for Enterprise RAG deployments, especially for medium/large datasets. It is optimized for better memory efficiency and query throughput while keeping high recall.

This is configurable via deployment/inventory/**/config.yaml as follows:

edp:
  ingestion:
    config:
      vector_algorithm: "SVS-VAMANA"
      vector_datatype: "FLOAT32"
      vector_distance_metric: "COSINE"

If your priority is faster index build time or compatibility with existing tuning profiles, HNSW remains a valid alternative.

For detailed trade-offs and parameter tuning, see Redis documentation:

Vector indexes overview: https://redis.io/docs/latest/develop/ai/search-and-query/vectors/
SVS-VAMANA reference and parameters: https://redis.io/docs/latest/develop/ai/search-and-query/vectors/#svs-vamana-index
SVS compression and tuning options: https://redis.io/docs/latest/develop/ai/search-and-query/vectors/svs-compression/
SVS-VAMANA perf comparison details: https://redis.io/blog/tech-dive-comprehensive-compression-leveraging-quantization-and-dimensionality-reduction/

Note that changing index settings can require additional RAM and storage for the vector database, since new indexes may be created before old ones are removed. This operation might be time-consuming for large datasets.

Ensure the Redis instances have enough resources assigned, both from compute and storage. This is configurable via deployment/inventory/**/config.yaml as follows:

vector_databases:
  enabled: true
  namespace: vdb
  vector_store: redis-cluster
  redis-cluster:
    persistence:
      size: "30Gi"
    resources:
      requests:
        cpu: 8
        memory: 16Gi
      limits:
        cpu: 16
        memory: 128Gi

In case of redis-cluster, all above settings are applied for each cluster node.

Component Scaling

TeiRerank Scaling

Match the number of TeiRerank replicas to the number of CPU sockets on your machine for optimal performance.
Adjust parameters in resources-reference-cpu.yaml.

# Example for a 2-socket system
teirerank:
  replicas: 2  # Set to number of CPU sockets

VLLM Scaling

Note

Automatic Configuration: When Balloons Policy is enabled (balloons.enabled: true in config.yaml), the system automatically discovers node topology and calculates optimal VLLM replica distribution. Manual configuration is only required when balloons.enabled: false.

Manual Configuration:

For machines with ≤64 physical cores per socket: use 1 replica per socket
For machines with >64 physical cores per socket (e.g., 96 or 128): use 2 replicas per socket
Adjust in resources-reference-cpu.yaml.

# Example for a 2-socket system with ≤64 cores per socket
vllm:
  replicas: 2  # 1 replica per socket × 2 sockets

# Example for a 2-socket system with >64 cores per socket
vllm:
  replicas: 4  # 2 replicas per socket × 2 sockets

Additionally, if your machine has less than 32 physical cores per NUMA node, you need to reduce the number of CPU cores for vLLM:

# Example for system with only 24 cores per NUMA node
  vllm:
    replicas: 1
    resources:
      requests:
        cpu: 24
        memory: 64Gi
      limits:
        cpu: 24
        memory: 100Gi

Note

Performance Tip: Consider enabling Sub-NUMA Clustering (SNC) in BIOS for better VLLM performance. This helps optimize memory access patterns across NUMA nodes.

LLM-usvc Scaling

When running more than one vLLM instance and when system is accessed by multiple concurrent users (e.g., 64+ users) use at least 2 replicas of llm-usvc.
Adjust parameters in resources-reference-cpu.yaml.

llm-usvc:
  replicas: 2

Runtime Parameter Tuning

You can adjust microservice parameters (e.g., top_k for reranker, k for retriever, max_new_tokens for LLM) using one of these methods:

Using the Admin Panel UI:
- Navigate to the Admin Panel section in the UI
- Find detailed instructions in UI features
Using Configuration Scripts:
- Utilize the helper scripts

Warning

Only parameters that don't require a microservice restart can be adjusted at runtime.

Horizontal Pod Autoscaling

Consider enabling HPA in order to allow the system to dynamically scale required resources in cluster.
HPA can be enabled in config.yaml:

hpaEnabled: true

Balloons Policy

Balloons Policy is responsible for assigning optimal resources for inference pods such as vLLM, embedding, reranking and it is crucial for the performance of the whole deployment.
It can be enabled in config.yaml:

balloons:
  enabled: true
  namespace: kube-system # alternatively, set custom namespace for balloons
  wait_timeout: 300 # timeout in seconds to wait for nri-plugin to be in ready state
  throughput_mode: true # set to true to optimize for horizontal scaling
  memory_overcommit_buffer: 0.1 # buffer (% of total memory) for pods using more memory than initially requested

Monitoring and Validation

After making performance tuning changes, monitor system performance using:

The built-in metrics dashboard
Load testing with sample queries
Memory and CPU utilization metrics

This will help validate that your changes have had the desired effect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Tuning Tips

Table of Contents

System Configuration Tips

LLM Model Selection

Vector Database Selection

Redis Vector Database Performance Settings

Component Scaling

TeiRerank Scaling

VLLM Scaling

LLM-usvc Scaling

Runtime Parameter Tuning

Horizontal Pod Autoscaling

Balloons Policy

Monitoring and Validation

FilesExpand file tree

performance_tuning_tips.md

Latest commit

History

performance_tuning_tips.md

File metadata and controls

Performance Tuning Tips

Table of Contents

System Configuration Tips

LLM Model Selection

Vector Database Selection

Redis Vector Database Performance Settings

Component Scaling

TeiRerank Scaling

VLLM Scaling

LLM-usvc Scaling

Runtime Parameter Tuning

Horizontal Pod Autoscaling

Balloons Policy

Monitoring and Validation