MarkItDown App

AI-powered document intelligence platform built on top of Microsoft's MarkItDown.

Convert PDFs and documents into clean Markdown while preserving tables, repairing Indic text, validating BOQs, and automatically applying OCR to scanned pages.

Features

PDF Extraction

High-quality PDF to Markdown conversion
Layout-aware text extraction
Deterministic reading order
Multi-page document support

OCR Fallback

Automatic OCR for scanned PDFs
OpenAI Vision integration
Corruption detection before OCR execution
Cost-optimized OCR triggering

Table Extraction

Table detection using PyMuPDF and pdfplumber
Multi-page table reconstruction
Broken row repair
Markdown table serialization

BOQ Validation

Missing serial number detection
Duplicate item detection
Quantity × Rate = Amount verification
UOM validation
Quality scoring

Indic Language Support

Hindi text correction
Gujarati text correction
Unicode corruption detection
Matra repair engine

Quality Reporting

Document quality score
OCR usage reporting
Table extraction score
Unicode quality score
BOQ validation score

Modern Web Interface

React frontend
FastAPI backend
Markdown preview
Quality dashboard
Download support

Architecture

Upload Document
        │
        ▼
   FastAPI API
        │
        ▼
Document Router
        │
        ├── Microsoft MarkItDown
        │      └── DOCX/XLSX/PPTX/HTML/etc
        │
        └── Custom PDF Pipeline
                │
                ▼
          Text Extraction
                │
                ▼
       Corruption Detection
                │
                ├── Clean Text
                │
                └── OCR Fallback
                        │
                        ▼
              Table Reconstruction
                        │
                        ▼
                 BOQ Validation
                        │
                        ▼
                 Markdown Output

Tech Stack

Backend

FastAPI
Microsoft MarkItDown
PyMuPDF
pdfplumber
OpenAI API

Frontend

React
Vite
JavaScript

PDF Processing

PyMuPDF
pdfplumber
OCR Vision Models

Supported Formats

Format	Support
PDF	✅
DOCX	✅
XLSX	✅
PPTX	✅
HTML	✅
TXT	✅
Images	✅
Audio	✅
ZIP	✅

All non-PDF formats are processed directly through Microsoft MarkItDown.

Installation

Backend

cd backend

python -m venv venv

# Windows
venv\Scripts\activate

pip install -r requirements.txt

Create .env

OPENAI_API_KEY=your_key_here

Run:

uvicorn app:app --reload

Frontend

cd frontend

npm install

npm run dev

API

Convert Document

POST /api/convert

Form Data:

file=<document>
use_ocr=true

Response:

{
  "markdown": "...",
  "quality_report": {
    "overall_score": 98,
    "unicode_score": 100,
    "table_score": 95,
    "boq_score": 100
  }
}

Performance

Verified Improvements

Concurrent request support
Thread-safe processing
Single pdfplumber instance per document
Upload size protection
OCR timeout protection
Safe resource cleanup

Validation Results

✅ 10/10 Automated Tests Passed

Concurrency Testing
Upload Limit Testing
OCR Timeout Testing
Resource Cleanup Testing
Regression Validation
BOQ Validation Testing

Use Cases

Government Tenders

BOQ extraction
Tender analysis
Rate validation
Quantity verification

Construction Industry

Quantity surveying
BOQ auditing
Cost estimation

Enterprise Documents

OCR processing
Table extraction
Markdown conversion
Knowledge ingestion

AI Workflows

RAG pipelines
Vector databases
LLM preprocessing
Knowledge bases

Project Structure

backend/
├── app.py
├── pipeline.py
├── ocr_engine.py
├── unicode_handler.py
├── corruption_detector.py
├── table_reconstructor.py
├── boq_validator.py
└── validation_engine.py

frontend/
├── src/
│   ├── App.jsx
│   └── components/
│       └── PreviewPane.jsx

Why This Project?

Microsoft MarkItDown is excellent for general document conversion.

This project extends it with:

Advanced PDF processing
OCR fallback
Table reconstruction
BOQ validation
Hindi support
Gujarati support
Quality reporting

while maintaining compatibility with the original MarkItDown architecture.

Roadmap

License

MIT License

Acknowledgements

Microsoft MarkItDown
FastAPI
PyMuPDF
pdfplumber
OpenAI

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
backend		backend
frontend		frontend
.gitattributes		.gitattributes
README.md		README.md
run.bat		run.bat

Folders and files

Latest commit

History

Repository files navigation

MarkItDown App

Features

PDF Extraction

OCR Fallback

Table Extraction

BOQ Validation

Indic Language Support

Quality Reporting

Modern Web Interface

Architecture

Tech Stack

Backend

Frontend

PDF Processing

Supported Formats

Installation

Backend

Frontend

API

Convert Document

Performance

Verified Improvements

Validation Results

Use Cases

Government Tenders

Construction Industry

Enterprise Documents

AI Workflows

Project Structure

Why This Project?

Roadmap

License

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages