Skip to content

j-a-y-e-s-h/markitdown-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MarkItDown App

AI-powered document intelligence platform built on top of Microsoft's MarkItDown.

Convert PDFs and documents into clean Markdown while preserving tables, repairing Indic text, validating BOQs, and automatically applying OCR to scanned pages.


Features

PDF Extraction

  • High-quality PDF to Markdown conversion
  • Layout-aware text extraction
  • Deterministic reading order
  • Multi-page document support

OCR Fallback

  • Automatic OCR for scanned PDFs
  • OpenAI Vision integration
  • Corruption detection before OCR execution
  • Cost-optimized OCR triggering

Table Extraction

  • Table detection using PyMuPDF and pdfplumber
  • Multi-page table reconstruction
  • Broken row repair
  • Markdown table serialization

BOQ Validation

  • Missing serial number detection
  • Duplicate item detection
  • Quantity × Rate = Amount verification
  • UOM validation
  • Quality scoring

Indic Language Support

  • Hindi text correction
  • Gujarati text correction
  • Unicode corruption detection
  • Matra repair engine

Quality Reporting

  • Document quality score
  • OCR usage reporting
  • Table extraction score
  • Unicode quality score
  • BOQ validation score

Modern Web Interface

  • React frontend
  • FastAPI backend
  • Markdown preview
  • Quality dashboard
  • Download support

Architecture

Upload Document
        │
        ▼
   FastAPI API
        │
        ▼
Document Router
        │
        ├── Microsoft MarkItDown
        │      └── DOCX/XLSX/PPTX/HTML/etc
        │
        └── Custom PDF Pipeline
                │
                ▼
          Text Extraction
                │
                ▼
       Corruption Detection
                │
                ├── Clean Text
                │
                └── OCR Fallback
                        │
                        ▼
              Table Reconstruction
                        │
                        ▼
                 BOQ Validation
                        │
                        ▼
                 Markdown Output

Tech Stack

Backend

  • FastAPI
  • Microsoft MarkItDown
  • PyMuPDF
  • pdfplumber
  • OpenAI API

Frontend

  • React
  • Vite
  • JavaScript

PDF Processing

  • PyMuPDF
  • pdfplumber
  • OCR Vision Models

Supported Formats

Format Support
PDF
DOCX
XLSX
PPTX
HTML
TXT
Images
Audio
ZIP

All non-PDF formats are processed directly through Microsoft MarkItDown.


Installation

Backend

cd backend

python -m venv venv

# Windows
venv\Scripts\activate

pip install -r requirements.txt

Create .env

OPENAI_API_KEY=your_key_here

Run:

uvicorn app:app --reload

Frontend

cd frontend

npm install

npm run dev

API

Convert Document

POST /api/convert

Form Data:

file=<document>
use_ocr=true

Response:

{
  "markdown": "...",
  "quality_report": {
    "overall_score": 98,
    "unicode_score": 100,
    "table_score": 95,
    "boq_score": 100
  }
}

Performance

Verified Improvements

  • Concurrent request support
  • Thread-safe processing
  • Single pdfplumber instance per document
  • Upload size protection
  • OCR timeout protection
  • Safe resource cleanup

Validation Results

✅ 10/10 Automated Tests Passed

  • Concurrency Testing
  • Upload Limit Testing
  • OCR Timeout Testing
  • Resource Cleanup Testing
  • Regression Validation
  • BOQ Validation Testing

Use Cases

Government Tenders

  • BOQ extraction
  • Tender analysis
  • Rate validation
  • Quantity verification

Construction Industry

  • Quantity surveying
  • BOQ auditing
  • Cost estimation

Enterprise Documents

  • OCR processing
  • Table extraction
  • Markdown conversion
  • Knowledge ingestion

AI Workflows

  • RAG pipelines
  • Vector databases
  • LLM preprocessing
  • Knowledge bases

Project Structure

backend/
├── app.py
├── pipeline.py
├── ocr_engine.py
├── unicode_handler.py
├── corruption_detector.py
├── table_reconstructor.py
├── boq_validator.py
└── validation_engine.py

frontend/
├── src/
│   ├── App.jsx
│   └── components/
│       └── PreviewPane.jsx

Why This Project?

Microsoft MarkItDown is excellent for general document conversion.

This project extends it with:

  • Advanced PDF processing
  • OCR fallback
  • Table reconstruction
  • BOQ validation
  • Hindi support
  • Gujarati support
  • Quality reporting

while maintaining compatibility with the original MarkItDown architecture.


Roadmap

  • Multi-provider OCR support
  • Batch document processing
  • Excel export
  • Advanced table detection
  • Local OCR models
  • Enterprise dashboard

License

MIT License


Acknowledgements

  • Microsoft MarkItDown
  • FastAPI
  • PyMuPDF
  • pdfplumber
  • OpenAI

About

AI-powered PDF extraction and document intelligence platform with OCR, table extraction, BOQ validation, tender analysis, multilingual support, and Microsoft MarkItDown compatibility.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors