A RAG (Retrieval Augmented Generation) system that enables intelligent conversations with documents. Upload PDF and DOCX files, ask questions and get accurate answers based solely on your document content.
- Document Upload & Processing: Automatic extraction and processing of PDF and DOCX files
- Semantic Search: Vector-based search using embeddings to find relevant information by meaning, not just keywords
- RAG Implementation: Retrieval Augmented Generation ensures answers are based only on uploaded documents
- Real-time Status Updates: Live file processing status via Socket.IO
- Multi-file Support: Select and search across multiple documents simultaneously
- Transparent Responses: System explicitly states when information isn't available in documents
- AI/ML:
- Google Gemini API (
@google/genai) for content generation - Gemini Embeddings (
gemini-embedding-001) for vector embeddings
- Google Gemini API (
- File Processing:
pdf-parsefor PDF text extractionmammothfor DOCX text extraction
- Storage: Local file storage (with abstraction for future cloud migration)
- File Upload: User uploads PDF or DOCX file via drag-and-drop
- Raw Storage: File saved to local storage with UUID-based naming
- Text Extraction:
- PDF: Extracted using
pdf-parse - DOCX: Extracted using
mammoth
- PDF: Extracted using
- Chunking: Text split into semantic chunks (~2000 chars) by paragraphs
- Embedding Generation: Each chunk converted to vector embeddings using Gemini Embeddings API (batched, 15 chunks at a time)
- Storage:
- Raw files stored locally
- Chunks stored in MongoDB
chunkscollection with embeddings - File metadata stored in MongoDB
filescollection
- Status Updates: Real-time status updates (
processing→ready/error) via Socket.IO
- File Selection: User selects files via checkboxes
- Query Embedding: User's question converted to embedding vector
- Vector Search: MongoDB
$vectorSearchfinds top 3-10 most relevant chunks from selected files - Context Injection: Relevant chunks injected into prompt as context
- Response Generation: Gemini model generates answer using the provided context
- Transparency: If information isn't available, system explicitly states so
- Index: MongoDB vector search index on
embeddingfield - Search Method: Semantic similarity using cosine distance
- Filtering: Results filtered by selected
fileIds - Scoring: Results include
vectorSearchScorefor relevance ranking
This project requires specific versions of Node.js and pnpm. Please check the engines and packageManager fields in package.json for the required versions.
If you're using nvm, you can automatically switch to the correct Node.js version by running:
nvm useThis will read the version from the .nvmrc file and switch to it automatically.
If you have Corepack enabled (included with Node.js 16.10+), pnpm will automatically use the version specified in the packageManager field of package.json. You can enable Corepack by running:
corepack enableTo run the infrastructure and all services -- just run:
pnpm start- Start base infrastructure services in Docker containers:
pnpm run infra
- Run the services with Turborepo:
pnpm run turbo-start
To run the infrastructure and all services, execute:
pnpm run docker- Start base infrastructure services in Docker containers:
pnpm run infra
- Run the services you need:
./bin/start.sh api web
You can also run infrastructure services separately using the ./bin/start.sh bash script.