Skip to content

paralect/rag-demo

Repository files navigation

AI-Powered Document Chat Application

A RAG (Retrieval Augmented Generation) system that enables intelligent conversations with documents. Upload PDF and DOCX files, ask questions and get accurate answers based solely on your document content.

🎯 Key Features

  • Document Upload & Processing: Automatic extraction and processing of PDF and DOCX files
  • Semantic Search: Vector-based search using embeddings to find relevant information by meaning, not just keywords
  • RAG Implementation: Retrieval Augmented Generation ensures answers are based only on uploaded documents
  • Real-time Status Updates: Live file processing status via Socket.IO
  • Multi-file Support: Select and search across multiple documents simultaneously
  • Transparent Responses: System explicitly states when information isn't available in documents

🏗️ Architecture

Tech Stack

  • AI/ML:
    • Google Gemini API (@google/genai) for content generation
    • Gemini Embeddings (gemini-embedding-001) for vector embeddings
  • File Processing:
    • pdf-parse for PDF text extraction
    • mammoth for DOCX text extraction
  • Storage: Local file storage (with abstraction for future cloud migration)

🔄 How It Works

1. Document Upload & Processing

  1. File Upload: User uploads PDF or DOCX file via drag-and-drop
  2. Raw Storage: File saved to local storage with UUID-based naming
  3. Text Extraction:
    • PDF: Extracted using pdf-parse
    • DOCX: Extracted using mammoth
  4. Chunking: Text split into semantic chunks (~2000 chars) by paragraphs
  5. Embedding Generation: Each chunk converted to vector embeddings using Gemini Embeddings API (batched, 15 chunks at a time)
  6. Storage:
    • Raw files stored locally
    • Chunks stored in MongoDB chunks collection with embeddings
    • File metadata stored in MongoDB files collection
  7. Status Updates: Real-time status updates (processingready/error) via Socket.IO

2. Query Processing (RAG Flow)

  1. File Selection: User selects files via checkboxes
  2. Query Embedding: User's question converted to embedding vector
  3. Vector Search: MongoDB $vectorSearch finds top 3-10 most relevant chunks from selected files
  4. Context Injection: Relevant chunks injected into prompt as context
  5. Response Generation: Gemini model generates answer using the provided context
  6. Transparency: If information isn't available, system explicitly states so

3. Vector Search Implementation

  • Index: MongoDB vector search index on embedding field
  • Search Method: Semantic similarity using cosine distance
  • Filtering: Results filtered by selected fileIds
  • Scoring: Results include vectorSearchScore for relevance ranking

Prerequisites

This project requires specific versions of Node.js and pnpm. Please check the engines and packageManager fields in package.json for the required versions.

Node.js

If you're using nvm, you can automatically switch to the correct Node.js version by running:

nvm use

This will read the version from the .nvmrc file and switch to it automatically.

pnpm

If you have Corepack enabled (included with Node.js 16.10+), pnpm will automatically use the version specified in the packageManager field of package.json. You can enable Corepack by running:

corepack enable

Starting Application with Turborepo 🚀

To run the infrastructure and all services -- just run:

pnpm start

Running Infra and Services Separately with Turborepo

  1. Start base infrastructure services in Docker containers:
    pnpm run infra
  2. Run the services with Turborepo:
    pnpm run turbo-start

Using Ship with Docker

To run the infrastructure and all services, execute:

pnpm run docker

Running Infra and Services Separately with Docker

  1. Start base infrastructure services in Docker containers:
    pnpm run infra
  2. Run the services you need:
    ./bin/start.sh api web

You can also run infrastructure services separately using the ./bin/start.sh bash script.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published