Skip to content

krishnakoushik225/DocuMind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DocuMind πŸ“„

Explainable RAG Document Assistant Β· GPT-4 + Pinecone + FastAPI + React

Python FastAPI React Pinecone OpenAI LlamaIndex License: MIT

DocuMind is a production-grade explainable RAG system β€” upload any PDF, ask natural language questions, and get answers grounded strictly in your document content, with full retrieval transparency, per-document scoped queries, and LlamaIndex-powered response evaluation.


🎬 Demo

Home Page

Home Page

Citation-Grounded Response

Sample Query

Trustworthy AI Output

Trustworthy Response

Explainability & Retrieval Transparency

Explainability


🎯 What Makes DocuMind Different from a Basic RAG Chatbot

Most RAG apps do: question β†’ embed β†’ top-k chunks β†’ generate

DocuMind does:

PDF upload
  β†’ PyPDF2 extraction + regex cleaning
  β†’ CharacterTextSplitter (configurable chunk_size + overlap)
  β†’ text-embedding-ada-002 β†’ Pinecone upsert (with doc_id metadata)
  β†’ per-document scoped query (doc_id + chunk_id filtering)
  β†’ context-constrained GPT-4 prompt (no hallucination beyond retrieved evidence)
  β†’ answer + retrieved_chunks returned together
  β†’ LlamaIndex RelevancyEvaluator scoring (no labeled dataset needed)
Dimension Basic RAG DocuMind
Retrieval scope Global index β€” all documents Per-document scoped via doc_id filter
Hallucination prevention None LLM prompt hard-constrained to retrieved context
Empty retrieval handling Hallucinates or crashes Returns "No relevant information found." explicitly
Transparency Answer only Answer + retrieved chunks returned together
Evaluation None LlamaIndex RelevancyEvaluator β€” no ground truth needed
Document management No GET /api/list_documents β€” full document registry

πŸ—οΈ Architecture

graph TD
    A[PDF Upload<br/>POST /api/upload] --> B[PyPDF2 Text Extraction]
    B --> C[Text Cleaning<br/>regex β€” strip non-alphabetic]
    C --> D[CharacterTextSplitter<br/>separator=newline Β· chunk_size Β· overlap]
    D --> E[text-embedding-ada-002<br/>OpenAI Embeddings]
    E --> F[(Pinecone Vector Store<br/>doc_id metadata per chunk)]

    G[User Query<br/>POST /api/query] --> H[Embed Query<br/>get_embedding_model]
    H --> I[Pinecone Similarity Search<br/>top_k Β· doc_id filter Β· chunk_id filter]
    F --> I
    I --> J{Chunks retrieved?}
    J -->|No| K[Return: No relevant information found<br/>no hallucination]
    J -->|Yes| L[Context Formation<br/>join retrieved chunks]
    L --> M[GPT-4 Generation<br/>constrained to retrieved context only]
    M --> N[QueryResponse<br/>answer + retrieved_chunks]

    N --> O[LlamaIndex RelevancyEvaluator<br/>response quality scoring]

    P[GET /api/list_documents] --> Q[Pinecone Document Registry<br/>all uploaded doc_ids]
Loading

API Surface

Endpoint Method Description
/api/upload POST Upload PDF β†’ extract β†’ chunk β†’ embed β†’ store in Pinecone
/api/query POST Query with optional doc_id + chunk_id scoping β†’ grounded answer
/api/list_documents GET List all documents currently indexed in Pinecone

πŸ”¬ Worked Example

Upload: financial_report_2024.pdf β†’ 847 chunks extracted and indexed

Query: "What was the net revenue in Q3?"

Pipeline trace:

[Upload]     β†’ PyPDF2 extracts 42 pages β†’ cleaned β†’ split into 847 chunks
             β†’ each chunk embedded via text-embedding-ada-002
             β†’ upserted to Pinecone with doc_id="financial_report_2024.pdf"

[Query]      β†’ "What was the net revenue in Q3?" embedded
             β†’ Pinecone similarity search: top_k=5, doc_id filter active
             β†’ 5 chunks retrieved from financial_report_2024.pdf only

[Prompt]     β†’ "Using the following document content, provide a concise
               and accurate answer... Do not speculate beyond the provided content."
             β†’ GPT-4 constrained strictly to retrieved context

[Response]   β†’ answer: "Q3 net revenue was $4.2B, up 12% YoY per page 18."
             β†’ retrieved_chunks: [chunk_1, chunk_2, ..., chunk_5] returned alongside

[Evaluation] β†’ LlamaIndex RelevancyEvaluator scores response against retrieved context

What happens on a question the document can't answer:

[Query]      β†’ "Who is the CEO of Apple?"
[Retrieval]  β†’ 0 relevant chunks found for doc_id="financial_report_2024.pdf"
[Response]   β†’ "No relevant information found."   ← never hallucinates

βš™οΈ Engineering Decisions

Per-document query scoping via Pinecone metadata filtering Naively querying a shared Pinecone index returns chunks from all uploaded documents β€” answers bleed across files. DocuMind stores doc_id as metadata on every chunk at upsert time and applies it as a filter on every similarity search. Optional chunk_id filtering enables pinpoint retrieval of specific sections. This makes multi-document deployments correct by design rather than an afterthought.

Hard-constrained prompt to eliminate hallucination Standard RAG prompts say "use this context to answer." DocuMind's prompt explicitly says:

"If the document does not contain relevant information, state that explicitly.
Do not speculate beyond the provided content."

Combined with the graceful empty-retrieval path (retrieved_chunks == [] β†’ return "No relevant information found."), the system has two independent hallucination guards β€” at the retrieval layer and at the generation layer.

Three-router modular FastAPI architecture Upload, query, and document management are split into independent routers (upload.py, query.py, list_documents.py) each registered under /api. This means each pipeline stage can be tested, extended, or replaced independently β€” adding a new embedding model or chunking strategy only touches one module, not the entire backend.

Configurable CORS with environment-driven origins Frontend URL and allowed origins are driven by config.py (ALLOWED_ORIGINS, FRONTEND_URL) rather than hardcoded β€” the same backend runs in local dev, staging, and production without code changes. allow_credentials=True with allow_methods=["*"] ensures multipart file uploads work correctly across browsers.

LlamaIndex evaluation without labeled ground truth Standard RAG evaluation requires a labeled QA dataset to measure answer correctness. DocuMind uses LlamaIndex's RelevancyEvaluator which measures whether the generated response is consistent with the retrieved context β€” no labeled dataset required. This makes evaluation practical for any arbitrary PDF without upfront annotation work.


πŸ”’ Trustworthy AI by Design

DocuMind was built with four explainability principles as first-class requirements:

Evidence-backed answers β€” GPT-4 is prompted to answer only from retrieved chunks. retrieved_chunks are returned in the API response alongside the answer so every claim is traceable to a source.

Retrieval transparency β€” The query pipeline surfaces which chunks were retrieved, in what order, and from which document. The user sees the evidence, not just the conclusion.

Graceful uncertainty β€” When retrieval finds nothing relevant, the system says so explicitly rather than generating a plausible-sounding but unsupported answer.

Scoped context β€” Per-document filtering ensures answers are drawn from the correct source document, not contaminated by other uploaded files in the shared index.


πŸš€ Quick Start

Prerequisites

  • Python 3.10+
  • Node.js 18+
  • OpenAI API key
  • Pinecone API key (free tier available)

Backend Setup

git clone https://github.com/krishnakoushik225/DocuMind
cd DocuMind/backend
pip install -r requirements.txt

Create .env:

OPENAI_API_KEY=your_openai_key
PINECONE_API_KEY=your_pinecone_key
PINECONE_INDEX=documind
FRONTEND_URL=http://localhost:5173

Run:

uvicorn main:app --reload
# API running at http://127.0.0.1:8000
# Docs at http://127.0.0.1:8000/docs

Frontend Setup

cd ../ui-v1
npm install
npm run dev
# Running at http://localhost:5173

πŸ› οΈ Tech Stack

Layer Technology Role
Frontend React + Vite Chat-based document interface
Backend FastAPI (async) Three-router API: upload / query / documents
PDF Processing PyPDF2 Structured text extraction from PDFs
Text Splitting LangChain CharacterTextSplitter Configurable chunking with overlap
Embedding OpenAI text-embedding-ada-002 Semantic vector representation
Vector DB Pinecone Similarity search with metadata filtering
LLM OpenAI GPT-4 Context-constrained answer generation
Retrieval LlamaIndex Similarity search + RelevancyEvaluator
Config Environment-driven (config.py) CORS origins, chunk params, upload dir

πŸ“ Project Structure

DocuMind/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ main.py                    # FastAPI entrypoint β€” 3 routers, CORS config
β”‚   β”œβ”€β”€ config.py                  # Env-driven: API keys, chunk_size, overlap, CORS
β”‚   β”œβ”€β”€ routes/
β”‚   β”‚   β”œβ”€β”€ upload.py              # POST /api/upload β€” extract, chunk, embed, store
β”‚   β”‚   β”œβ”€β”€ query.py               # POST /api/query β€” retrieve, constrain, generate
β”‚   β”‚   └── list_documents.py      # GET /api/list_documents β€” Pinecone doc registry
β”‚   β”œβ”€β”€ services/
β”‚   β”‚   β”œβ”€β”€ vector_store.py        # Pinecone upsert, similarity search, doc listing
β”‚   β”‚   β”œβ”€β”€ embedding_service.py   # text-embedding-ada-002 wrapper
β”‚   β”‚   └── llm_service.py         # GPT-4 client wrapper
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   └── document.py            # QueryRequest, QueryResponse Pydantic models
β”‚   └── requirements.txt
β”œβ”€β”€ ui-v1/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ App.jsx                # Chat interface
β”‚   β”‚   └── App.css
β”‚   β”œβ”€β”€ vite.config.ts
β”‚   └── package.json
β”œβ”€β”€ Home Page.png
β”œβ”€β”€ Sample Query.png
β”œβ”€β”€ Picture1.png
β”œβ”€β”€ Picture2.png
└── README.md

πŸ”­ Roadmap

  • Multi-document cross-reference queries (single question, multiple PDFs)
  • Conversation memory β€” follow-up questions with session context
  • Streaming token output to frontend in real-time
  • OCR support for scanned PDFs (Tesseract / AWS Textract)
  • Agentic upgrade β€” multi-hop reasoning across document sections
  • Document comparison mode β€” diff two PDFs via natural language
  • Export annotated answers as PDF with highlighted source passages
  • Hybrid retrieval β€” BM25 + semantic search for keyword-dense documents

πŸ“„ License

MIT β€” free to use and build on.


Built by Krishna Koushik Unnam

About

Docu-Mind is an AI-powered document assistant that answers questions from PDFs using natural language. It combines RAG, GPT-4, and Pinecone to deliver fast, accurate, and verifiable responses through a FastAPI backend and React interface.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors