Skip to content

ritukanchi/pdf-extraction

Repository files navigation

PDF Extraction


Overview

This Nextjs , Flask-based API extracts text from PDFs, supporting both searchable and non-searchable PDFs (via OCR using Tesseract). It includes asynchronous processing with a status tracking system.


1. Prerequisites

Before running the project, ensure you have the following installed:

  • Docker and Docker Compose
  • Python 3.10+ (if running without Docker)
  • Tesseract OCR (tesseract-ocr installed in Docker)

2. Running with Docker

2.1 Build and Start the Containers

From the project root directory, run:

docker-compose up --build

This will:

  • Build the backend (Flask) and frontend (Next.js) images
  • Start both services and expose them on:
    • Backend → http://localhost:5000
    • Frontend → http://localhost:3000

To run the backend only:

docker-compose up --build backend

2.2 Stopping the Containers

To stop all services, use:

docker-compose down

3. Running Locally (Without Docker)

3.1 Create a Virtual Environment

To isolate dependencies, create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3.2 Install Dependencies

Install the required Python packages:

pip install -r backend/requirements.txt

3.3 Run the Flask Server

Navigate to the backend directory and start the server:

cd backend
python app.py

Flask will start on http://127.0.0.1:5000.


4. API Endpoints

4.1 Extract Text from a PDF (Asynchronous)

Endpoint:

POST /extract

Request Body (JSON):

{
  "pdf_url": "https://example.com/sample.pdf"
}

Response (202 Accepted - Task Queued):

{
  "task_id": "123e4567-e89b-12d3-a456-426614174000"
}

4.2 Check Task Status

Endpoint:

GET /status/{task_id}

Response (Still Processing - 202):

{
  "status": "processing",
  "progress": 25,
  "total_pages": 100
}

Response (Completed - 200):

{
  "task_id": "123e4567-e89b-12d3-a456-426614174000",
  "text_chunks": [
    {
      "text": "Extracted text from page",
      "bbox": [50, 100, 200, 150],
      "page": 0
    }
  ],
  "metadata": {
    "pages": 100,
    "size": 204800
  }
}

5. Debugging & Logs

5.1 View Docker Logs

docker logs -f pdf-extraction-backend-1

5.2 Check if Tesseract is Installed in Docker

docker exec -it pdf-extraction-backend-1 bash
tesseract -v

5.3 Kill and Restart Docker Services

docker-compose down
docker-compose up --build

6. Sample PDFs you can test with