📄 IntelliDoc: AI-Powered Multilingual Document Analysis System

Transform unstructured documents into interactive, searchable knowledge systems

Features • Demo • Installation • Usage • Architecture • Technologies

🎯 Overview

IntelliDoc is an intelligent document understanding system that extracts, analyzes, and queries information from PDFs, scanned documents, images, and multilingual reports using OCR, AI, and Retrieval-Augmented Generation (RAG).

Why IntelliDoc?

Traditional document systems fail with:

❌ Scanned and handwritten documents
❌ Multilingual content (especially Indian languages)
❌ Context-aware search
❌ Hallucination-free answers

IntelliDoc solves these with:

✅ Smart OCR with auto-detection
✅ Support for 10+ languages
✅ Document-grounded RAG pipeline
✅ 3x faster parallel processing

✨ Features

🚀 Core Capabilities

Smart OCR Detection - Auto-detects when OCR is needed
Parallel Processing - 3x faster with multi-threaded extraction
Multilingual Support - English, Kannada, Hindi, Tamil, Telugu, Marathi, Malayalam, Gujarati, Bengali, Punjabi
RAG Pipeline - Document-grounded answers (no hallucinations)

🔥 Advanced Features

Table Extraction - Automatic table detection and parsing
Chart Detection - Identify flowcharts, diagrams, graphs
Audio I/O - Voice input and text-to-speech output
Real-time Translation - Translate queries and responses
Page References - Pinpoint source pages for answers

🎬 Demo

# Quick Start
streamlit run app.py

Sample Interactions

👤 User: "Summarize this document in Kannada"
🤖 IntelliDoc: [Provides Kannada summary with page references]

👤 User: "What are the key dates mentioned?"
🤖 IntelliDoc: [Lists dates with exact page numbers]

👤 User: "Translate the main points to Hindi"
🤖 IntelliDoc: [Translates with source context]

🛠️ Installation

Prerequisites

Python 3.8+
Tesseract OCR
Groq API Key

Step 1: Clone Repository

git clone https://github.com/ramyadjoshi/IntelliDoc-AI-Powered-Intelligent-Document-Analysis-System.git
cd IntelliDoc-AI-Powered-Intelligent-Document-Analysis-System

Step 2: Install Dependencies

pip install -r requirements.txt

Step 3: Install Tesseract OCR

Windows:

# Download from: https://github.com/UB-Mannheim/tesseract/wiki
# Install to: C:\Program Files\Tesseract-OCR

Linux:

sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-hin tesseract-ocr-kan tesseract-ocr-tam

macOS:

brew install tesseract
brew install tesseract-lang

Step 4: Configure Environment

Create a .env file:

GROQ_API_KEY=your_groq_api_key_here
TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.exe  # Windows
# TESSERACT_CMD=/usr/bin/tesseract  # Linux/Mac

Step 5: Run Application

streamlit run app.py

Visit http://localhost:8501 in your browser.

📖 Usage

1. Upload Documents

PDFs - Regular or scanned
Images - JPG, PNG, TIFF, BMP
Processing Modes:
- 🚀 Smart Mode (auto-detects OCR need)
- 🔍 Force OCR (for handwritten/scanned docs)

2. Advanced Options

☑️ Extract Tables
☑️ Detect Charts/Diagrams

3. Ask Questions

Examples:
- "Summarize in Hindi"
- "What is the main topic?"
- "List all dates mentioned"
- "Translate key points to Telugu"
- "Extract numerical values"

4. Download Results

📥 Chat history (TXT/JSON)
📄 Extracted text
📊 Extracted tables

🏗️ Architecture

RAG Pipeline

┌─────────────────────────────────────────────────────────────┐
│                     INDEXING PHASE                          │
├─────────────────────────────────────────────────────────────┤
│  PDF/Image → OCR → Text Extraction → Chunking → TF-IDF      │
│       ↓                                            ↓        │
│  Tables/Charts Detection              FAISS Vector Store    │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                      QUERY PHASE                            │
├─────────────────────────────────────────────────────────────┤
│  User Query → TF-IDF Vector → FAISS Search → Top-K Chunks   │
│       ↓                                            ↓        │
│  Language Detection              Context + Query → LLM      │
│       ↓                                            ↓        │
│  Translation Request?                    Grounded Answer    │
└─────────────────────────────────────────────────────────────┘

Key Components

Component	Purpose	Technology
Frontend	User interface	Streamlit
PDF Processing	Text/image extraction	PyMuPDF (Fitz)
OCR Engine	Scanned document reading	Tesseract OCR
Image Processing	Preprocessing & chart detection	OpenCV, Pillow
Vectorization	Text to numerical embeddings	TF-IDF
Search Index	Fast similarity search	FAISS
LLM	Question answering	Llama 3.3 (Groq API)
Language Detection	Auto-detect query language	langdetect

🔧 Technologies

Core Stack

streamlit>=1.28.0      # Web UI
PyMuPDF>=1.23.0        # PDF processing
pytesseract>=0.3.10    # OCR
opencv-python>=4.8.0   # Image processing
Pillow>=10.0.0         # Image handling
scikit-learn>=1.3.0    # TF-IDF vectorization
faiss-cpu>=1.7.4       # Vector search
groq>=0.4.0            # LLM API
langdetect>=1.0.9      # Language detection
pandas>=2.0.0          # Table handling

Supported Languages

🌐 English • Kannada • Hindi • Tamil • Telugu • Marathi • Malayalam • Gujarati • Bengali • Punjabi

📊 Performance

Metric	Value
Processing Speed	3x faster than sequential
OCR Languages	10+ Indian languages
Max PDF Pages	Unlimited (parallel processing)
Chunk Size	1500 chars (300 overlap)
Retrieval Method	FAISS + TF-IDF
Context Window	4500 chars

Optimization Features

⚡ Parallel page processing (4 workers)
🎯 Smart OCR detection (avoids unnecessary OCR)
🧩 Intelligent text chunking
🔍 Top-K retrieval (K=8)
📈 Similarity threshold filtering

🎓 Academic Project

Developers: Ramya D Joshi • Shakuntala K Pawar
Mentor: Dr. R. H. Goudar
Institution: Visvesvaraya Technological University, Belagavi , Karnataka.
Project Phase: Major Project

Research Contributions

Smart OCR Detection - Reduces processing time by auto-detecting OCR need
Multilingual RAG - First Indian language-focused document QA system
Parallel Processing - 3x speedup using ThreadPoolExecutor
Document-Grounded Answers - Zero hallucination through RAG

📝 Configuration

Environment Variables

GROQ_API_KEY=your_api_key          # Required: Groq API for LLM
TESSERACT_CMD=path/to/tesseract    # Required: Tesseract binary path
TESSDATA_PREFIX=path/to/tessdata   # Optional: Tesseract language data

Customizable Parameters

OCR_LANGS = "eng+kan+hin+tam+tel+mar+mal+guj+ben+pan"
TOP_K = 8                  # Number of chunks to retrieve
CHUNK_SIZE = 1500          # Characters per chunk
CHUNK_OVERLAP = 300        # Overlap between chunks
MAX_WORKERS = 4            # Parallel processing threads

🐛 Troubleshooting

Common Issues

1. Tesseract not found

# Windows: Install from https://github.com/UB-Mannheim/tesseract/wiki
# Linux: sudo apt-get install tesseract-ocr tesseract-ocr-[lang]
# Mac: brew install tesseract tesseract-lang

2. FAISS installation error

pip install faiss-cpu  # For CPU
# pip install faiss-gpu  # For GPU (if available)

3. Groq API errors

# Check API key in .env file
# Verify API quota at https://console.groq.com

4. Language data missing

# Download language data from:
# https://github.com/tesseract-ocr/tessdata
# Place in TESSDATA_PREFIX directory

🚀 Future Enhancements

Support for more document formats (DOCX, PPTX, XLSX)
Advanced chart/table understanding with vision models
Real-time collaborative document analysis
Cloud deployment (AWS/Azure/GCP)
Mobile app version
Batch processing API
Enhanced handwriting recognition
Custom fine-tuned models for domain-specific docs

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Tesseract OCR - Google's open-source OCR engine
FAISS - Meta's similarity search library
Groq - Ultra-fast LLM inference
Streamlit - Rapid web app framework
PyMuPDF - Fast PDF processing

📧 Contact

Ramya D Joshi - GitHub Profile
Shakuntala K Pawar - GitHub Profile

Project Repository: IntelliDoc

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
.gitignore		.gitignore
.python-version		.python-version
app.py		app.py
pyrag.py		pyrag.py
readme.md		readme.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📄 IntelliDoc: AI-Powered Multilingual Document Analysis System

🎯 Overview

Why IntelliDoc?

✨ Features

🚀 Core Capabilities

🔥 Advanced Features

🎬 Demo

Sample Interactions

🛠️ Installation

Prerequisites

Step 1: Clone Repository

Step 2: Install Dependencies

Step 3: Install Tesseract OCR

Step 4: Configure Environment

Step 5: Run Application

📖 Usage

1. Upload Documents

2. Advanced Options

3. Ask Questions

4. Download Results

🏗️ Architecture

RAG Pipeline

Key Components

🔧 Technologies

Core Stack

Supported Languages

📊 Performance

Optimization Features

🎓 Academic Project

Research Contributions

📝 Configuration

Environment Variables

Customizable Parameters

🐛 Troubleshooting

Common Issues

🚀 Future Enhancements

📄 License

🙏 Acknowledgments

📧 Contact

⭐ If you find IntelliDoc useful, please consider giving it a star!

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages