Skip to content

ramyadjoshi/IntelliDoc-AI-Powered-Intelligent-Document-Analysis-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📄 IntelliDoc: AI-Powered Multilingual Document Analysis System

Python Streamlit License OCR AI

Transform unstructured documents into interactive, searchable knowledge systems

FeaturesDemoInstallationUsageArchitectureTechnologies


🎯 Overview

IntelliDoc is an intelligent document understanding system that extracts, analyzes, and queries information from PDFs, scanned documents, images, and multilingual reports using OCR, AI, and Retrieval-Augmented Generation (RAG).

Why IntelliDoc?

Traditional document systems fail with:

  • ❌ Scanned and handwritten documents
  • ❌ Multilingual content (especially Indian languages)
  • ❌ Context-aware search
  • ❌ Hallucination-free answers

IntelliDoc solves these with:

  • ✅ Smart OCR with auto-detection
  • ✅ Support for 10+ languages
  • ✅ Document-grounded RAG pipeline
  • ✅ 3x faster parallel processing

✨ Features

🚀 Core Capabilities

  • Smart OCR Detection - Auto-detects when OCR is needed
  • Parallel Processing - 3x faster with multi-threaded extraction
  • Multilingual Support - English, Kannada, Hindi, Tamil, Telugu, Marathi, Malayalam, Gujarati, Bengali, Punjabi
  • RAG Pipeline - Document-grounded answers (no hallucinations)

🔥 Advanced Features

  • Table Extraction - Automatic table detection and parsing
  • Chart Detection - Identify flowcharts, diagrams, graphs
  • Audio I/O - Voice input and text-to-speech output
  • Real-time Translation - Translate queries and responses
  • Page References - Pinpoint source pages for answers

🎬 Demo

# Quick Start
streamlit run app.py

Sample Interactions

👤 User: "Summarize this document in Kannada"
🤖 IntelliDoc: [Provides Kannada summary with page references]

👤 User: "What are the key dates mentioned?"
🤖 IntelliDoc: [Lists dates with exact page numbers]

👤 User: "Translate the main points to Hindi"
🤖 IntelliDoc: [Translates with source context]

🛠️ Installation

Prerequisites

  • Python 3.8+
  • Tesseract OCR
  • Groq API Key

Step 1: Clone Repository

git clone https://github.com/ramyadjoshi/IntelliDoc-AI-Powered-Intelligent-Document-Analysis-System.git
cd IntelliDoc-AI-Powered-Intelligent-Document-Analysis-System

Step 2: Install Dependencies

pip install -r requirements.txt

Step 3: Install Tesseract OCR

Windows:

# Download from: https://github.com/UB-Mannheim/tesseract/wiki
# Install to: C:\Program Files\Tesseract-OCR

Linux:

sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-hin tesseract-ocr-kan tesseract-ocr-tam

macOS:

brew install tesseract
brew install tesseract-lang

Step 4: Configure Environment

Create a .env file:

GROQ_API_KEY=your_groq_api_key_here
TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.exe  # Windows
# TESSERACT_CMD=/usr/bin/tesseract  # Linux/Mac

Step 5: Run Application

streamlit run app.py

Visit http://localhost:8501 in your browser.


📖 Usage

1. Upload Documents

  • PDFs - Regular or scanned
  • Images - JPG, PNG, TIFF, BMP
  • Processing Modes:
    • 🚀 Smart Mode (auto-detects OCR need)
    • 🔍 Force OCR (for handwritten/scanned docs)

2. Advanced Options

  • ☑️ Extract Tables
  • ☑️ Detect Charts/Diagrams

3. Ask Questions

Examples:
- "Summarize in Hindi"
- "What is the main topic?"
- "List all dates mentioned"
- "Translate key points to Telugu"
- "Extract numerical values"

4. Download Results

  • 📥 Chat history (TXT/JSON)
  • 📄 Extracted text
  • 📊 Extracted tables

🏗️ Architecture

RAG Pipeline

┌─────────────────────────────────────────────────────────────┐
│                     INDEXING PHASE                          │
├─────────────────────────────────────────────────────────────┤
│  PDF/Image → OCR → Text Extraction → Chunking → TF-IDF      │
│       ↓                                            ↓        │
│  Tables/Charts Detection              FAISS Vector Store    │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                      QUERY PHASE                            │
├─────────────────────────────────────────────────────────────┤
│  User Query → TF-IDF Vector → FAISS Search → Top-K Chunks   │
│       ↓                                            ↓        │
│  Language Detection              Context + Query → LLM      │
│       ↓                                            ↓        │
│  Translation Request?                    Grounded Answer    │
└─────────────────────────────────────────────────────────────┘

Key Components

Component Purpose Technology
Frontend User interface Streamlit
PDF Processing Text/image extraction PyMuPDF (Fitz)
OCR Engine Scanned document reading Tesseract OCR
Image Processing Preprocessing & chart detection OpenCV, Pillow
Vectorization Text to numerical embeddings TF-IDF
Search Index Fast similarity search FAISS
LLM Question answering Llama 3.3 (Groq API)
Language Detection Auto-detect query language langdetect

🔧 Technologies

Core Stack

streamlit>=1.28.0      # Web UI
PyMuPDF>=1.23.0        # PDF processing
pytesseract>=0.3.10    # OCR
opencv-python>=4.8.0   # Image processing
Pillow>=10.0.0         # Image handling
scikit-learn>=1.3.0    # TF-IDF vectorization
faiss-cpu>=1.7.4       # Vector search
groq>=0.4.0            # LLM API
langdetect>=1.0.9      # Language detection
pandas>=2.0.0          # Table handling

Supported Languages

🌐 EnglishKannadaHindiTamilTeluguMarathiMalayalamGujaratiBengaliPunjabi


📊 Performance

Metric Value
Processing Speed 3x faster than sequential
OCR Languages 10+ Indian languages
Max PDF Pages Unlimited (parallel processing)
Chunk Size 1500 chars (300 overlap)
Retrieval Method FAISS + TF-IDF
Context Window 4500 chars

Optimization Features

  • ⚡ Parallel page processing (4 workers)
  • 🎯 Smart OCR detection (avoids unnecessary OCR)
  • 🧩 Intelligent text chunking
  • 🔍 Top-K retrieval (K=8)
  • 📈 Similarity threshold filtering

🎓 Academic Project

Developers: Ramya D Joshi • Shakuntala K Pawar
Mentor: Dr. R. H. Goudar
Institution: Visvesvaraya Technological University, Belagavi , Karnataka.
Project Phase: Major Project

Research Contributions

  1. Smart OCR Detection - Reduces processing time by auto-detecting OCR need
  2. Multilingual RAG - First Indian language-focused document QA system
  3. Parallel Processing - 3x speedup using ThreadPoolExecutor
  4. Document-Grounded Answers - Zero hallucination through RAG

📝 Configuration

Environment Variables

GROQ_API_KEY=your_api_key          # Required: Groq API for LLM
TESSERACT_CMD=path/to/tesseract    # Required: Tesseract binary path
TESSDATA_PREFIX=path/to/tessdata   # Optional: Tesseract language data

Customizable Parameters

OCR_LANGS = "eng+kan+hin+tam+tel+mar+mal+guj+ben+pan"
TOP_K = 8                  # Number of chunks to retrieve
CHUNK_SIZE = 1500          # Characters per chunk
CHUNK_OVERLAP = 300        # Overlap between chunks
MAX_WORKERS = 4            # Parallel processing threads

🐛 Troubleshooting

Common Issues

1. Tesseract not found

# Windows: Install from https://github.com/UB-Mannheim/tesseract/wiki
# Linux: sudo apt-get install tesseract-ocr tesseract-ocr-[lang]
# Mac: brew install tesseract tesseract-lang

2. FAISS installation error

pip install faiss-cpu  # For CPU
# pip install faiss-gpu  # For GPU (if available)

3. Groq API errors

# Check API key in .env file
# Verify API quota at https://console.groq.com

4. Language data missing

# Download language data from:
# https://github.com/tesseract-ocr/tessdata
# Place in TESSDATA_PREFIX directory

🚀 Future Enhancements

  • Support for more document formats (DOCX, PPTX, XLSX)
  • Advanced chart/table understanding with vision models
  • Real-time collaborative document analysis
  • Cloud deployment (AWS/Azure/GCP)
  • Mobile app version
  • Batch processing API
  • Enhanced handwriting recognition
  • Custom fine-tuned models for domain-specific docs

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


🙏 Acknowledgments

  • Tesseract OCR - Google's open-source OCR engine
  • FAISS - Meta's similarity search library
  • Groq - Ultra-fast LLM inference
  • Streamlit - Rapid web app framework
  • PyMuPDF - Fast PDF processing

📧 Contact

Ramya D Joshi - GitHub Profile
Shakuntala K Pawar - GitHub Profile

Project Repository: IntelliDoc


⭐ If you find IntelliDoc useful, please consider giving it a star!

Visitors

About

IntelliDoc is an intelligent document understanding system that helps users extract, analyze, and query information from PDFs, scanned documents, images, and multilingual reports using OCR, AI, and Retrieval-Augmented Generation (RAG)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages