Releases: opea-project/Enterprise-RAG
2.1.1: Intel® AI for Enterprise RAG - patch release
Getting Started
To deploy the Intel® AI for Enterprise RAG application, follow the instructions.
Highlights
- Enhanced Reliability of NetApp Trident Integration: Improved installation and cleanup processes for ONTAP deployment, ensuring consistent creation of backup objects and better stability when managing Trident drivers
- New Reverse Proxy Support for File Uploads: Introduced the reverse_proxy_storage option in config.yaml with automation enabling file uploads directly through the web interface with ONTAP deployment.
- TDX Validation Complete for All Enterprise RAG 2.1 Solutions: Validated TDX compatibility for ChatQnA, AudioQnA, and DocSum, with documentation updates included
- Improved Nutanix Documentation: Enhanced clarity and completeness of guidance under docs/nutanix.
- Critical Vulnerabilities Resolved
Publications
- Give Your RAG a Voice: Building an Audio Q&A Experience with Intel® AI for Enterprise RAG
- Accelerate AI Value Creation with Nutanix and Intel® AI for Enterprise RAG
- Converging Paradigms: Architecting a Hybrid and Open Platform for Unified HPC and AI Workloads
- Starting With the End in Mind: Intel and Nutanix’s Blueprint for an Enterprise-Grade RAG Chatbot
Detailed changes
Deployment:
- fixed creation of backup object when installing trident driver
- added tests that verifies if there is connectivity from worker nodes to ONTAP data and managemet lif
- corrected cleanup when removing netapp trident drivers
- added reverse_proxy_storage option in config.yaml and add automation that allows uploading files via web brower
- validated TDX with Enterprise RAG 2.1 for all the solutions (Chatqna, Audioqna, Docsum) and updated the documentation
- improved the documentation in docs/nutanix
- aligned default embedding model server to vllm in a pipeline with external endpoint
2.1.0: Intel® AI for Enterprise RAG
Getting Started
To deploy the Intel® AI for Enterprise RAG application, follow the instructions.
Highlights:
- New solution integrated! You can use AudioQnA pipeline to transcript audio prompt, read the Chatbot output, ingest audio data, and utilize a dedicated UI.
- New default embedding model server: VLLM.
- New storage layer: MinIO was replaced with Seaweed as a primary file‑storage backend.
- Extended Upgradability: full version tracking, automatic upgrade detection, and unified version metadata.
- Improved UI: chat pinning & search, bulk ingestion actions, DocSum strategy selection, and reduced bundle size via validation/markdown library refactors.
- Text Extractor upgrades: better PDF parsing, deeper PPTX/DOCX extraction, and audio file support.
- Document processing performance boosts: faster embedding with Celery parallelization and faster uploads with new upload‑optimized mode.
Detailed changes
AI / Development
New AudioQnA Solution
- ASR microservice using vLLM model server for transcription.
- TTS microservice built with FastAPI, enabling audio responses.
- Namespace Status Watcher microservice for AudioQnA health validation.
- Text Extractor extended to parse MP3/WAV for transcription.
New embedding model server: VLLM
- vLLM Embedding now the default embedding backend.
- Removed LLM_OPENAI_FORMAT_STREAMING & LLM_CONNECTOR; only OpenAI‑style streaming is supported now.
Enhanced Dataprep Pipeline improvements
- PDF text quality boosted via pymupdf4llm.
- Improved data extraction of PPT/PPTX/DOC/DOCX: full extraction of comments, SmartArt, notes, diagrams, embedded Excel sheets.
- Added MP3/WAV ingestion (AudioQnA only).
- [preview] MS SQL Server 2025 added as a new alternative Vector Database
Upgradability & Versioning
- Full deployment lifecycle tracked via ConfigMap.
- Automatic detection of upgrade vs install vs refresh.
- Prevents unsupported downgrades or mismatched deployments.
- Unified version source at deployment/version.yaml.
- Improved update_charts.py for automated chart and pyproject.toml updates.
Additional Features
- Cancellation wrapper in microservices (TTS, DocSum) to stop processing when user aborts.
- Accuracy Evaluator
- Added query‑type filtering.
- Bucket‑based filtering for simulating accuracy across bucket distributions.
Deployment
Storage Layer Update
- Replaced MinIO with SeaweedFS as the primary file‑storage backend.
- Added a fully managed Ansible deployment workflow for SeaweedFS.
- Updated EDP/UI logic to support SeaweedFS Advanced IAM using Bearer Token authentication, provided as an alternative to MinIO’s authentication model.
Document Processing Pipeline Enhancements
- Optimized the document embedding and ingestion workflow by replacing sequential batch processing with a high‑throughput parallel pipeline architecture.
- Improved overall performance and increased utilization of embedding/ingestion services.
[preview] Document Upload Mode
- This release introduces a new mechanism allowing the system to switch between two pipeline modes:
- ChatQnA Mode
- Standard operational mode enabling chat interactions.
- Full access to chat UI and EDP pipeline with default resource allocation.
- Document Upload Mode (Enhanced EDP Resource Allocation)
- Pipeline switches to a document‑upload–optimized configuration, allocating more resources to EDP components.
- Chat UI is automatically disabled to ensure system capacity and stability.
- Admin Area remains fully accessible, allowing operations, monitoring, and management tasks.
- ChatQnA Mode
Additional Features
- Updated infrastructure automation scripts to support new Kubernetes versions
- Tested Kubernetes versions: 1.32.9 1.33.5
- Updated Gaudi stack to: 1.22.2-32
- NRI Balloons Controller - A Kubernetes mutating webhook was added that ensures selected pods wait for the NRI balloons DaemonSet to be ready.
User Interface
AudioQnA Application
- Introduced a standalone UI application for the AudioQnA pipeline.
- Users can record messages using microphone input.
- Users can play back a single response message (playback only; no pause functionality).
- Control Plane view now displays additional statuses for the Audio Speech Recognition and Text‑to‑Speech microservices.
- Added support for uploading audio files via Data Ingestion (MP3 and WAV formats).
Additional Features
- Chat History now allows users to pin selected items to the top of the list.
- A search bar has been added to Chat History, enabling users to quickly find specific items.
- The Control Plane side panel can now be shown or hidden using a dedicated toggle button.
- Users can now perform Retry or Delete actions on multiple selected files and links within the Data Ingestion view.
- In the DocSum UI application, users can select a specific strategy before generating a summary.
Refactors
- Replaced the
yuplibrary withzodfor input validation. - Replaced the
react-markdownlibrary withmarkedfor Markdown parsing. - UI image build process has been optimized by removing redundant steps.
Telemetry
- VLLM Dashboard in Grafana improved
- Dashboard for AudioQnA solution added
Known issues
- The default embedding model server has been updated to vLLM. However, late chunking is currently supported only when using TorchServe. vLLM does not support late chunking at this time.
- Late Chunking with similarity_search_with_siblings may exceed context. Using late chunking with search_type="similarity_search_with_siblings" may cause context overflow. It is recommended to use late chunking with the default search type, which does not include neighboring chunks.
- When playing back audio via Text‑to‑Speech (TTS) in the UI, the Stop action is not functional. Playback can only be interrupted by refreshing the page.
- Processing large TTS requests may cause the service to crash. A permanent fix is planned for a future release.
2.0.1: Intel® AI for Enterprise RAG
Getting Started
To deploy the Intel® AI for Enterprise RAG application, follow the instructions.
Highlights:
- Late Chunking Enhancements: Improved document ingestion performance in late chunking mode by up to 6×, ensured chunk alignment with original text for better accuracy, and added full telemetry for the late chunking microservice and its logs in Grafana.
- Two new publications on Enterprise RAG:
- vLLM CPU updated: Version updated to v0.11.2
Detailed changes
AI / Development
- Improved late chunking ingestion performance by reducing TorchServe serialization overhead, resulting in up to 6× faster document ingestion.
- Updated chunk extraction in late chunking mode to pull text directly from the original document, improving accuracy and consistency.
- Upgraded vLLM CPU to v0.11.2 and added high-concurrency handling by increasing connection limits and keep-alive duration for long-running requests
Deployment
- Improved file upload speed by:
- Dynamically calculating Celery BATCH_SIZE.
- Increasing resources for the extractor pod.
- Upgradability Improvements:
- Implemented version tracking for all ERAG components and enabled UI to display deployment version.
- Added a post-upgrade integrity check to verify data retention. (Currently requires manual execution).
- Added a pre-upgrade health check to ensure upgrades occur on healthy deployments. (Currently requires manual execution).
User Interface
ChatQnA
Chat
- Enhanced chat conversation feed UX:
- Chat feed no longer auto-scrolls when a historical chat item is selected; the conversation now starts from the beginning. A Scroll to Bottom button is available for quick navigation.
- When a user sends a new message, the chat scrolls down instantly, ensuring the message is visible at the top of the feed. Remaining space is preserved for streamed responses.
- Chat feed no longer scrolls automatically during response streaming.
- These changes also resolve issues where users could not scroll up during long streamed answers.
- Fixed an issue with Chat History streaming across different chats. All history items and related data (messages, sources, user input, etc.) are now stored separately, preventing previous conflicts.
Admin Panel
- Fixed an issue where the vLLM service node in the Admin Panel’s Control Plane tab was sporadically marked red. StatefulSet state is now interpreted correctly.
- Fixed an issue where email addresses and URLs enclosed in angle brackets (e.g., email@example.com) were removed from UI output.
Telemetry
- Extended Redis telemetry probe timeouts to improve stability.
- Added late chunking microservice metrics and logs, visible in Grafana.
- Extended the Accuracy Evaluator with configurable paths for setup configuration and cluster credentials, allowing non-default locations.
Known issues
- Late Chunking with similarity_search_with_siblings may exceed context. Using late chunking with search_type="similarity_search_with_siblings" may cause context overflow. It is recommended to use late chunking with the default search type, which does not include neighboring chunks.
- Empty refferences and source indexes. Chatbot randomly provide answers with empty references and source indexes when casperhansen/llama-3-8b-instruct-awq LLM model is used.
2.0.0: Intel® AI for Enterprise RAG
Getting Started
To deploy your Intel® AI for Enterprise RAG application, please follow the instructions.
Highlights:
- New use case added! You can now use Intel® AI for Enterprise RAG Document Summarization with separate pipeline and UI for text and file-based summaries.
- Replaced Bitnami images with custom Helm charts for Redis, MongoDB, Postgres, Apisix, and Keycloak to limit third-party dependencies.
- Added automated balloon sizing and reboot-survivability features (Istio streamlining, RAG refresh CronJob) to maximize hardware utilization and improve automatic recovery.
- Added Active Directory support for enterprise authentication.
- Enabled external inference endpoint support for flexible hybrid deployments with remote LLM services.
- Introduced PLLuM models with Polish prompt templates.
Detailed changes
AI / Development
- Document Summarization pipeline integrated
- Added Active Directory support for seamless integration with enterprise applications
- Added support for external inference endpoint for VLLM
- PLLuM models were integrated into the pipeline, together with automatic support for polish prompt templates
- [preview] Introduced Late Chunking as a preview feature, an advanced text-processing technique that improves embedding quality by preserving more semantic context across chunk boundaries
- Added an fallback option for generating presignedUrls if the storage endpoint is not configured or not capable of token credential validation
- LoadPdf in Text Extractor parallized
- Made HF_TOKEN optional - if model is not a gated/restricted one, you don't need to pass a HF_TOKEN now
- Align LLM microservice with OpenAI API – LLM microservice can now be easily used in third party chains and pipelines
- VLLM HPU updated to v0.9.0.1+Gaudi-1.22.0
- Added
docs/accuracy_tuning_tips.mdwith guidance for tuning accuracy with Late Chunking and other techniques - Added
src/comps/vectorstores/CONTRIBUTING.mdwith instructions on how to enabled a new vector database to the pipeline
Deployment
- Replaced Bitnami images and Helm charts with self created solutions for:
- Redis (vdb)
- Fingerprint and Chat history (MongoDB)
- EDP (PostgreSQL)
- Apisix
- Keycloak
- Added automated calculation of balloon sizes.
- Added balloons for torchserve-embedding component
- TDX with One TD approach has been moved to production ready feature
- [preview] Created an
installer.shscript that allows to deploy entire solution on pre-configured software - A series of features have been added for the pipeline to survive the reboot of the cluster:
- Istio streamlined – Istio is being applied at the beginning of deployment now
- Added a CronJob
rag-watcherto refresh RAG services after node reboot, ensuring clean startup and operation
- Upgradability:
- Metadata pre-upgrade verification implemented – will compare metadata available in deployed pipeline with ones coming with an upgrade
- Data consistency report added - report volume of user data in components of deployed pipeline
User Interface
- Document Summarization UI added
- Users can summarize plain text or content from document file (supported file extensions: DOC, DOCX, PDF, MD).
- Generated summaries are stored in client-side history (retained until the page is refreshed or the session ends).
- Admin Panel Tabs:
- Control Plane – Displays pipeline status.
- Telemetry & Authentication – Provides links to Grafana and Keycloak.
- ChatQnA UI - Admin Panel: Added support for filtering and sorting columns in data tables within the Data Ingestion tab.
Telemetry
- Introduced a new enabled flag for telemetry traces, allowing users to control whether traces are deployed (default: false)
- Migrated the OpenTelemetry Collector base image from Ubuntu to Debian
- Upgraded telemetry components, including Grafana and associated Helm charts
- Updated instructions and behavior for accessing logs in Grafana's Explore view, reflecting changes in newer Grafana versions
- Added new monitor for the Docsum pipeline
Known issues
- A regression in performance was observed during data ingestion in Enhanced Dataprep Pipeline. Currently, the pipeline is optimized for the chat, which can slow down file uploads. If you have a lot to upload, consider a workaround: install the pipeline with balloons.enabled:False -> HPA will scale the embeddings. After uploading the files, install-on-install with balloons.enabled:True for best chat performance.
- It was observed that telemetry tracing might fail sporadically during deployment. That's why tracing was disabled at the moment.
- When telemetry tracing is enabled, only one component's spans are visible in Tempo. Expected behavior is to see spans for all eRAG microservices in the distributed trace.
- During late chunking, text decoding performed by the tokenizer introduces formatting changes compared to the original source (e.g., lowercase conversion, added separators). As a result, retrieved chunks may not fully match the original document.
- For ChatQnA pipeline in Admin Panel's Control Plane tab vLLM service node may be sporadically colored with red as Not ready state is read from API for its StatefulSet state
- Document Summarization drag and drop file upload doesn't work. Please use Browse Files.
1.5.0: Intel® AI for Enterprise RAG
Getting Started
To deploy your Chat Q&A RAG application, please follow the instructions.
Highlights:
- Added EDP PostgreSQL migration strategy (default-enabled) for smoother upgrades
- Included source chunk text in guardrail / LLM output payloads for better traceability
- Simplified guardrails: system prompt template removed; only user prompt validated by default
- Implemented automatic MinIO–Keycloak OIDC self-healing cron job
- Added TorchServe balloon policies and Gaudi performance optimizations (incl. reranker pinning & auto vLLM scaling)
- Replaced TEI reranker with TorchServe reranker for improved efficiency
- Added Terraform scripts for AWS deployment plus configurable vector DB type & dimensions
- Enhanced Chat UI: source chunk dialog, stable history saving, Firefox interrupt fix
Detailed changes
AI / Development
- Implemented EDP database (PostgreSQL) migration strategy (enabled by default) to simplify upgrades
- Included chunk text in source metadata (LLM / output guard responses now return chunk content)
- Removed system prompt template from guardrails (only user prompt checked; reranked_docs and past answers still optional via Dataprep / output guardrails when enabled)
- Implemented cron job to auto-verify and reconfigure MinIO OIDC linkage with Keycloak (fixes stale presigned URL issues without admin action)
- Integrated latest GenAIComps core changes to accelerate microservice prototyping
Deployment
- Implemented balloon policies for TorchServe on Gaudi
- Implemented performance optimizations:
- Replaced TEI reranker with TorchServe reranker
- Added CPU pinning for TorchServe reranker
- Enabled automatic scaling of vLLM instances
- Added Terraform scripts to deploy ERAG on AWS
- Added configuration options for vector database type and vector dimensions to streamline embedding / reranker model changes
User Interface
Chat
- Added clickable source buttons that open a dialog showing retrieved chunks used to generate the answer
- Moved file download / external link actions to dialog footer (contextual buttons)
- Fixed Firefox error handling when interrupting streamed responses
- Set chat rename character limit to 250 (aligned with API constraint)
- Refactored chat history saving: background /save call now avoids unnecessary UI refresh and screen blinking unless a non-guardrails error occurs
Admin Panel
Control Plane
- Fixed sentiment scanner threshold argument range
- Added input validation and tooltip for Code Scanner supported languages
- Removed "Edit Service Arguments" button; "Confirm Changes" and "Cancel" now remain disabled until a modification is made
Data Ingestion
- Updated Processing Time column to display "N/A" for Uploaded state or zero start time
- Added UI performance optimizations to reduce unnecessary re-renders and screen blinking on data refetch
Telemetry
- Renamed GMC router metrics prefix from "llm" to "router" for clarity
- Added Grafana dashboards: E2E Time to First Token, E2E Pipeline Latency, Pre-LLM Pipeline Latency
- Fixed log visibility issue in Grafana when deploying pipeline via Kubespray
Known issues
-
User can ask a question exceeding word limit, resulting in a general error
-
Random issue of chatbot not providing context-sensitive answer to a specific prompt although relevant content was provided
-
Post-install Gaudi operator installation fails in slow network conditions
-
Grafana Logs Drilldown fails with grafana-lokiexplore-app plugin version 1.0.27:
Opening view in Grafana "Explore → Logs → Show Logs" may crash with error:Error: Minified React error #130 .... This occurs with grafana-lokiexplore-app v1.0.27 (released 2025-09-17). To workaround, downgrade the plugin version to v1.0.26. To do that, edit thetelemetry-grafanaConfigMap to pin version 1.0.26 (see screenshot below), then restart themonitoring/telemetry-grafana-xxx-xxxk8s pod for the change to take effect.

To verify, go to Grafana → Administration → Plugins and search for "Grafana Logs Drilldown" and confirm that the installed version is 1.0.26, as shown below.

1.4.0: Intel® AI for Enterprise RAG
Getting Started
To deploy your Chat Q&A RAG application, please follow the instructions.
Highlights:
Major new features and improvements:
- Chat History: Users can now save, rename, export, and delete chats.
- Source Attribution in UI: RAG sources used in responses are now visible and downloadable.
- Accuracy Evaluation: Integrated GenAIEvals scripts for RAG performance testing.
- Multi-node Deployment Support: Includes node discovery and NUMA-aware vLLM sizing.
- Velero Backup Integration: Automated backup/restore now optional(if enabled in config.yaml) part of cluster lifecycle.
- Detailed Ingestion Timing: Users can inspect time breakdowns for each ingestion stage.
- Large File Deletion Bug Fixed: Files with >10,000 chunks now fully deleted.
Detailed changes
AI/Development
- Introduced Chat History: Endpoint details in src/comps/chat_history.
- Ported Accuracy Evaluation scripts from OPEA's GenAIEvals to Enterprise RAG (src/tests/e2e/evals/evaluation/rag_eval).
- RAG Source Attribution: UI now displays which ingested documents contributed to answers; files are downloadable.
- Detailed EDP Timing: Clicking ingestion time reveals breakdown (text extraction, splitting, etc.).
- Translation Pipeline (Preview): API-accessible, not yet in UI. Details in deployment/README.md#additional-pipelines.
- Large File Deletion Fix: Files with >10,000 chunks now properly deleted.
Deployment
- Added multi-node deployment support.
- Introduced node discovery mechanism.
- Created balloons policy and HPA support for torchserve-reranker.
- Enabled NUMA-aware vLLM sizing and inventory-based configuration.
- Moved PCV section and model definitions to inventory.
- Automated NFS server installation in infrastructure.yaml post-install tasks.
- Added automated backup/restore playbooks.
- Moved Velero installation to infrastructure.
- Added Terraform deployment for Gaudi 3 node on IBM Cloud.
- Changed default to use HPA with balloons policy.
User Interface
Chat
- Chats saved in left panel; users can rename, export (JSON), or delete.
- If ingested data was used, sources appear below responses:
- Links open in new tab.
- Files are downloaded directly.
Admin Panel
Control Plane
- Configurable services marked with cog icon; only these are clickable.
Data Ingestion
- Clicking Processing Time shows stage durations:
- Standard: 00:00:06.239
- Compact: 6s 239ms
- Auto-refresh every 10s until final status (Error, Ingested, etc.); toggleable in settings.
- Bulk ingestion via .txt file: URLs separated by commas, spaces, or new lines.
- Bucket Synchronization Dialog: Review and sync S3 discrepancies via UI.
Known issues
- [API-only] Deleting >70 documents at once may result in incomplete deletion.
- [input guards] After enabling input guards, using a forbidden word will cause the next three consecutive user queries to be blocked due to chat history enforcement (N+3)
- [vllm-gaudi] When running Enterprise RAG on Gaudi with the default Mixtral 8x7B model, only a single HPU device will be utilized
1.3.2: Intel® AI for Enterprise RAG - patch release
Release Notes
Detailed Changes
AI/Development
- Fix for Header/Footer stripper in TextCompressor microservice
- Enhanced documentation for Performance Tuning Tips
Known issues
- For Qwen models, it's possible to see artifact in the response.
1.3.1: Intel® AI for Enterprise RAG - patch release
Release Notes
Highlights:
- Enhanced model support with six additional LLMs including Meta-Llama-3.1, Qwen3, and Mistral variants
- Upgraded vLLM version to 0.9.2
- Expanded testing capabilities with pubMed dataset support and fixes for e2e performance tests
Publications:
Detailed Changes
AI/Development
-
Added support for the following models:
- hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
- meta-llama/Llama-3.1-8B-Instruct
- Qwen/Qwen3-14B-AWQ
- Qwen/Qwen3-14B
- solidrust/Mistral-7B-Instruct-v0.3-AWQ
- mistralai/Mistral-7B-Instruct-v0.3
-
Upgraded vLLM version to 0.9.2
-
Updated default resources for the standard redis and text-splitter microservice to avoid OOM errors
-
Added support for custom templates in resources-model-cpu.yaml
-
Added support for pubMed dataset and fixed input token length in e2e performance tests
-
Added "Performance Tuning Guide" for Xeon deployment
Known issues
- For Qwen models, it's possible to see artifact in the response.
1.3.0: Intel® AI for Enterprise RAG
Getting Started
To deploy your Chat Q&A RAG application, please follow the instructions.
Highlights:
- Retriever RBAC support: Document filtering based on user's access privileges to underlying S3 storage, enhancing security and data access control.
- Enhanced text extraction: Improved extraction for PDF, DOC, DOCX, and images including better hyperlink, table, and image text processing.
- Microservice architecture improvements: Split Dataprep into separate TextExtractor and TextSplitter services with new TextCompression microservice for cleaner document processing.
- Advanced retrieval algorithms: Added similarity_search_with_siblings algorithm to improve response accuracy by including adjacent chunks.
- Improved Redis implementation: Migrated to standalone namespace with Helm chart support for both single node and cluster setups for better performance.
- Backup/restore functionality: Added Velero-based backup and restore capabilities for Keycloak, EDP, and vector store database.
- UI Accessibility: Enhanced accessibility with React ARIA components and added syntax highlighting for code snippets.
Detailed changes
AI/Development
- Added Retriever RBAC support - document filtering based on user's access privileges to underlying S3 storage.
- Enhanced text extraction for PDF, DOC, DOCX, and images - improved hyperlink extraction, table text extraction, and image text extraction.
- Migrated text extraction from custom loader classes to Markitdown for ADOC, TXT, JSON, JSONL, CSV, XLSX, XLS, HTML, MD, XML, and YAML file formats.
- Introduced MarkdownSplitter for ADOC, MD, and HTML files to split text by sections and add this information to metadata.
- Added filename/URL and Section information to prompt template, improving responses to questions about document names.
- Split Dataprep microservice into separate TextExtractor and TextSplitter services.
- Introduced TextCompression microservice between TextExtractor and TextSplitter to clean and compress document text. More details here.
- Added similarity_search_with_siblings algorithm to retriever, configurable in Admin Panel, which improves response accuracy by including adjacent chunks.
- Enabled semantic chunking in Ansible and debug feature, with fixes for large files.
- Introduced Hierarchical Indexing for PDF files as an experimental feature, configurable via
config.yaml. Learn more here.
User Interface
- Improved accessibility by refactoring UI components with React ARIA.
- Added syntax highlighting for code snippets in Chat.
- Implemented automatic scaling of ChatQnA pipeline graph size in Admin Panel - Control Plane.
Deployment
- Migrated Redis vector database from ChatQnA pipeline to standalone namespace.
- Deployed Redis via Helm chart - supporting both single node Redis and Redis-cluster for improved performance.
- Implemented balloons policy as an alternative method of pinning VLLM resources.
- Created backup/restore functionality using Velero for Keycloak, EDP, and vector store database. Installation steps, update and restore procedure are described in documentation.
- Added support for deployment under user-defined domain names.
- Created Ansible scripts for simplified Kubernetes deployment.
- Added Ansible scripts for deploying Gaudi via operator.
Security
- Removed non-functional scanners from guardrails.
- Enabled remaining input guardrails in UI.
- Fixed and enhanced guardrails end-to-end tests.
- Enabled fingerprint capability for dataprep guardrail.
- Upgraded LLM Guard package to version 3.16.
Known issues
- When using Redis as a vector database, the default resource settings are not optimized, causing Redis to start with configurations that are unsuitable for production environments or intensive testing. To address this, remove the existing resource and persistence node configurations from here. Update it with the following settings:
redis:
(...)
master:
persistence:
enabled: true
size: "10Gi"
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 16
memory: 16Gi
replica:
persistence:
enabled: true
size: "10Gi"
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 16
memory: 16GiNote: The resource configuration for redis-cluster is not affected and is correctly set up by default.
1.2.1: Intel® AI for Enterprise RAG - patch release
Release Notes
Highlights:
- Enhanced Performance: Improved hardware support with Habana Gaudi 1.21.0 and implemented core pinning for vLLM pods, resulting in better inference performance.
- Optimized Model Deployment: Added pre-configured optimizations for LLM models and set a default quantized model (
llama-3-8b-instruct-awq) for efficient CPU inference. - Improved Infrastructure Flexibility: Added support for user-defined domain names and S3-compatible storage backends, with smarter resource management that prevents unnecessary MinIO service activation.
- Enhanced Data Processing: Improved Dataprep capabilities with extended link parsing for supported file types and added safeguards to prevent service hangs.
- Extended Hardware Support: Added TDX support in deployment scripts and fixed installation paths for Gaudi-based deployments.
Detailed Changes
AI/Development
- Update Habana Gaudi to 1.21.0
- Dataprep - Enable parsing links that target files(only those extension that are already supported), not only html
- Fix parsing no_proxy parameter in EDP
- Add timeout to Dataprep microservices to avoid indefinite hangs
- Fix sticky session for the generic connector in LLM microservice to enable load balancing for multiple replicas
Deployment
- Created file with optimized configurations for running LLM models
- Set
casperhansen/llama-3-8b-instruct-awqas the default quantized model for CPU inference - Implemented core pinning mechanism for vLLM pods to improve performance
- Enabled user-defined domain name configuration
- Added support for TDX in Ansible deployment scripts
- Documentation update - added detailed instructions on setting up S3 or S3-compatible storage as a backend in EDP
- MinIO service is no longer started when a different storage backend (e.g., S3 or S3-compatible) is configured in EDP, preventing unnecessary resource usage
- Resolved issue with incorrect file paths in
install_chatqna.shfor Gaudi-based installations - the script now uses "hpu" as expected
Known issues
- GMC can update variables passed in config maps or as environment variables. Scripts cannot update changes that don't apply to other objects.