AI-powered document intelligence platform built on top of Microsoft's MarkItDown.
Convert PDFs and documents into clean Markdown while preserving tables, repairing Indic text, validating BOQs, and automatically applying OCR to scanned pages.
- High-quality PDF to Markdown conversion
- Layout-aware text extraction
- Deterministic reading order
- Multi-page document support
- Automatic OCR for scanned PDFs
- OpenAI Vision integration
- Corruption detection before OCR execution
- Cost-optimized OCR triggering
- Table detection using PyMuPDF and pdfplumber
- Multi-page table reconstruction
- Broken row repair
- Markdown table serialization
- Missing serial number detection
- Duplicate item detection
- Quantity × Rate = Amount verification
- UOM validation
- Quality scoring
- Hindi text correction
- Gujarati text correction
- Unicode corruption detection
- Matra repair engine
- Document quality score
- OCR usage reporting
- Table extraction score
- Unicode quality score
- BOQ validation score
- React frontend
- FastAPI backend
- Markdown preview
- Quality dashboard
- Download support
Upload Document
│
▼
FastAPI API
│
▼
Document Router
│
├── Microsoft MarkItDown
│ └── DOCX/XLSX/PPTX/HTML/etc
│
└── Custom PDF Pipeline
│
▼
Text Extraction
│
▼
Corruption Detection
│
├── Clean Text
│
└── OCR Fallback
│
▼
Table Reconstruction
│
▼
BOQ Validation
│
▼
Markdown Output
- FastAPI
- Microsoft MarkItDown
- PyMuPDF
- pdfplumber
- OpenAI API
- React
- Vite
- JavaScript
- PyMuPDF
- pdfplumber
- OCR Vision Models
| Format | Support |
|---|---|
| ✅ | |
| DOCX | ✅ |
| XLSX | ✅ |
| PPTX | ✅ |
| HTML | ✅ |
| TXT | ✅ |
| Images | ✅ |
| Audio | ✅ |
| ZIP | ✅ |
All non-PDF formats are processed directly through Microsoft MarkItDown.
cd backend
python -m venv venv
# Windows
venv\Scripts\activate
pip install -r requirements.txtCreate .env
OPENAI_API_KEY=your_key_hereRun:
uvicorn app:app --reloadcd frontend
npm install
npm run devPOST /api/convertForm Data:
file=<document>
use_ocr=true
Response:
{
"markdown": "...",
"quality_report": {
"overall_score": 98,
"unicode_score": 100,
"table_score": 95,
"boq_score": 100
}
}- Concurrent request support
- Thread-safe processing
- Single pdfplumber instance per document
- Upload size protection
- OCR timeout protection
- Safe resource cleanup
✅ 10/10 Automated Tests Passed
- Concurrency Testing
- Upload Limit Testing
- OCR Timeout Testing
- Resource Cleanup Testing
- Regression Validation
- BOQ Validation Testing
- BOQ extraction
- Tender analysis
- Rate validation
- Quantity verification
- Quantity surveying
- BOQ auditing
- Cost estimation
- OCR processing
- Table extraction
- Markdown conversion
- Knowledge ingestion
- RAG pipelines
- Vector databases
- LLM preprocessing
- Knowledge bases
backend/
├── app.py
├── pipeline.py
├── ocr_engine.py
├── unicode_handler.py
├── corruption_detector.py
├── table_reconstructor.py
├── boq_validator.py
└── validation_engine.py
frontend/
├── src/
│ ├── App.jsx
│ └── components/
│ └── PreviewPane.jsx
Microsoft MarkItDown is excellent for general document conversion.
This project extends it with:
- Advanced PDF processing
- OCR fallback
- Table reconstruction
- BOQ validation
- Hindi support
- Gujarati support
- Quality reporting
while maintaining compatibility with the original MarkItDown architecture.
- Multi-provider OCR support
- Batch document processing
- Excel export
- Advanced table detection
- Local OCR models
- Enterprise dashboard
MIT License
- Microsoft MarkItDown
- FastAPI
- PyMuPDF
- pdfplumber
- OpenAI