Transform unstructured documents into interactive, searchable knowledge systems
Features • Demo • Installation • Usage • Architecture • Technologies
IntelliDoc is an intelligent document understanding system that extracts, analyzes, and queries information from PDFs, scanned documents, images, and multilingual reports using OCR, AI, and Retrieval-Augmented Generation (RAG).
Traditional document systems fail with:
- ❌ Scanned and handwritten documents
- ❌ Multilingual content (especially Indian languages)
- ❌ Context-aware search
- ❌ Hallucination-free answers
IntelliDoc solves these with:
- ✅ Smart OCR with auto-detection
- ✅ Support for 10+ languages
- ✅ Document-grounded RAG pipeline
- ✅ 3x faster parallel processing
|
|
# Quick Start
streamlit run app.py👤 User: "Summarize this document in Kannada"
🤖 IntelliDoc: [Provides Kannada summary with page references]
👤 User: "What are the key dates mentioned?"
🤖 IntelliDoc: [Lists dates with exact page numbers]
👤 User: "Translate the main points to Hindi"
🤖 IntelliDoc: [Translates with source context]
- Python 3.8+
- Tesseract OCR
- Groq API Key
git clone https://github.com/ramyadjoshi/IntelliDoc-AI-Powered-Intelligent-Document-Analysis-System.git
cd IntelliDoc-AI-Powered-Intelligent-Document-Analysis-Systempip install -r requirements.txtWindows:
# Download from: https://github.com/UB-Mannheim/tesseract/wiki
# Install to: C:\Program Files\Tesseract-OCRLinux:
sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-hin tesseract-ocr-kan tesseract-ocr-tammacOS:
brew install tesseract
brew install tesseract-langCreate a .env file:
GROQ_API_KEY=your_groq_api_key_here
TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.exe # Windows
# TESSERACT_CMD=/usr/bin/tesseract # Linux/Macstreamlit run app.pyVisit http://localhost:8501 in your browser.
- PDFs - Regular or scanned
- Images - JPG, PNG, TIFF, BMP
- Processing Modes:
- 🚀 Smart Mode (auto-detects OCR need)
- 🔍 Force OCR (for handwritten/scanned docs)
- ☑️ Extract Tables
- ☑️ Detect Charts/Diagrams
Examples:
- "Summarize in Hindi"
- "What is the main topic?"
- "List all dates mentioned"
- "Translate key points to Telugu"
- "Extract numerical values"
- 📥 Chat history (TXT/JSON)
- 📄 Extracted text
- 📊 Extracted tables
┌─────────────────────────────────────────────────────────────┐
│ INDEXING PHASE │
├─────────────────────────────────────────────────────────────┤
│ PDF/Image → OCR → Text Extraction → Chunking → TF-IDF │
│ ↓ ↓ │
│ Tables/Charts Detection FAISS Vector Store │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ QUERY PHASE │
├─────────────────────────────────────────────────────────────┤
│ User Query → TF-IDF Vector → FAISS Search → Top-K Chunks │
│ ↓ ↓ │
│ Language Detection Context + Query → LLM │
│ ↓ ↓ │
│ Translation Request? Grounded Answer │
└─────────────────────────────────────────────────────────────┘
| Component | Purpose | Technology |
|---|---|---|
| Frontend | User interface | Streamlit |
| PDF Processing | Text/image extraction | PyMuPDF (Fitz) |
| OCR Engine | Scanned document reading | Tesseract OCR |
| Image Processing | Preprocessing & chart detection | OpenCV, Pillow |
| Vectorization | Text to numerical embeddings | TF-IDF |
| Search Index | Fast similarity search | FAISS |
| LLM | Question answering | Llama 3.3 (Groq API) |
| Language Detection | Auto-detect query language | langdetect |
streamlit>=1.28.0 # Web UI
PyMuPDF>=1.23.0 # PDF processing
pytesseract>=0.3.10 # OCR
opencv-python>=4.8.0 # Image processing
Pillow>=10.0.0 # Image handling
scikit-learn>=1.3.0 # TF-IDF vectorization
faiss-cpu>=1.7.4 # Vector search
groq>=0.4.0 # LLM API
langdetect>=1.0.9 # Language detection
pandas>=2.0.0 # Table handling🌐 English • Kannada • Hindi • Tamil • Telugu • Marathi • Malayalam • Gujarati • Bengali • Punjabi
| Metric | Value |
|---|---|
| Processing Speed | 3x faster than sequential |
| OCR Languages | 10+ Indian languages |
| Max PDF Pages | Unlimited (parallel processing) |
| Chunk Size | 1500 chars (300 overlap) |
| Retrieval Method | FAISS + TF-IDF |
| Context Window | 4500 chars |
- ⚡ Parallel page processing (4 workers)
- 🎯 Smart OCR detection (avoids unnecessary OCR)
- 🧩 Intelligent text chunking
- 🔍 Top-K retrieval (K=8)
- 📈 Similarity threshold filtering
Developers: Ramya D Joshi • Shakuntala K Pawar
Mentor: Dr. R. H. Goudar
Institution: Visvesvaraya Technological University, Belagavi , Karnataka.
Project Phase: Major Project
- Smart OCR Detection - Reduces processing time by auto-detecting OCR need
- Multilingual RAG - First Indian language-focused document QA system
- Parallel Processing - 3x speedup using ThreadPoolExecutor
- Document-Grounded Answers - Zero hallucination through RAG
GROQ_API_KEY=your_api_key # Required: Groq API for LLM
TESSERACT_CMD=path/to/tesseract # Required: Tesseract binary path
TESSDATA_PREFIX=path/to/tessdata # Optional: Tesseract language dataOCR_LANGS = "eng+kan+hin+tam+tel+mar+mal+guj+ben+pan"
TOP_K = 8 # Number of chunks to retrieve
CHUNK_SIZE = 1500 # Characters per chunk
CHUNK_OVERLAP = 300 # Overlap between chunks
MAX_WORKERS = 4 # Parallel processing threads1. Tesseract not found
# Windows: Install from https://github.com/UB-Mannheim/tesseract/wiki
# Linux: sudo apt-get install tesseract-ocr tesseract-ocr-[lang]
# Mac: brew install tesseract tesseract-lang2. FAISS installation error
pip install faiss-cpu # For CPU
# pip install faiss-gpu # For GPU (if available)3. Groq API errors
# Check API key in .env file
# Verify API quota at https://console.groq.com4. Language data missing
# Download language data from:
# https://github.com/tesseract-ocr/tessdata
# Place in TESSDATA_PREFIX directory- Support for more document formats (DOCX, PPTX, XLSX)
- Advanced chart/table understanding with vision models
- Real-time collaborative document analysis
- Cloud deployment (AWS/Azure/GCP)
- Mobile app version
- Batch processing API
- Enhanced handwriting recognition
- Custom fine-tuned models for domain-specific docs
This project is licensed under the MIT License - see the LICENSE file for details.
- Tesseract OCR - Google's open-source OCR engine
- FAISS - Meta's similarity search library
- Groq - Ultra-fast LLM inference
- Streamlit - Rapid web app framework
- PyMuPDF - Fast PDF processing
Ramya D Joshi - GitHub Profile
Shakuntala K Pawar - GitHub Profile
Project Repository: IntelliDoc