GitHub - ritukanchi/pdf-extraction

PDF Extraction

Overview

This Nextjs , Flask-based API extracts text from PDFs, supporting both searchable and non-searchable PDFs (via OCR using Tesseract). It includes asynchronous processing with a status tracking system.

1. Prerequisites

Before running the project, ensure you have the following installed:

Docker and Docker Compose
Python 3.10+ (if running without Docker)
Tesseract OCR (tesseract-ocr installed in Docker)

2. Running with Docker

2.1 Build and Start the Containers

From the project root directory, run:

docker-compose up --build

This will:

Build the backend (Flask) and frontend (Next.js) images
Start both services and expose them on:
- Backend → http://localhost:5000
- Frontend → http://localhost:3000

To run the backend only:

docker-compose up --build backend

2.2 Stopping the Containers

To stop all services, use:

docker-compose down

3. Running Locally (Without Docker)

3.1 Create a Virtual Environment

To isolate dependencies, create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3.2 Install Dependencies

Install the required Python packages:

pip install -r backend/requirements.txt

3.3 Run the Flask Server

Navigate to the backend directory and start the server:

cd backend
python app.py

Flask will start on http://127.0.0.1:5000.

4. API Endpoints

4.1 Extract Text from a PDF (Asynchronous)

Endpoint:

POST /extract

Request Body (JSON):

{
  "pdf_url": "https://example.com/sample.pdf"
}

Response (202 Accepted - Task Queued):

{
  "task_id": "123e4567-e89b-12d3-a456-426614174000"
}

4.2 Check Task Status

Endpoint:

GET /status/{task_id}

Response (Still Processing - 202):

{
  "status": "processing",
  "progress": 25,
  "total_pages": 100
}

Response (Completed - 200):

{
  "task_id": "123e4567-e89b-12d3-a456-426614174000",
  "text_chunks": [
    {
      "text": "Extracted text from page",
      "bbox": [50, 100, 200, 150],
      "page": 0
    }
  ],
  "metadata": {
    "pages": 100,
    "size": 204800
  }
}

5. Debugging & Logs

5.1 View Docker Logs

docker logs -f pdf-extraction-backend-1

5.2 Check if Tesseract is Installed in Docker

docker exec -it pdf-extraction-backend-1 bash
tesseract -v

5.3 Kill and Restart Docker Services

docker-compose down
docker-compose up --build

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
backend		backend
src/app		src/app
.env.local		.env.local
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
next-env.d.ts		next-env.d.ts
next.config.mjs		next.config.mjs
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

1. Prerequisites

2. Running with Docker

2.1 Build and Start the Containers

2.2 Stopping the Containers

3. Running Locally (Without Docker)

3.1 Create a Virtual Environment

3.2 Install Dependencies

3.3 Run the Flask Server

4. API Endpoints

4.1 Extract Text from a PDF (Asynchronous)

4.2 Check Task Status

5. Debugging & Logs

5.1 View Docker Logs

5.2 Check if Tesseract is Installed in Docker

5.3 Kill and Restart Docker Services

6. Sample PDFs you can test with

About

Releases

Packages

Languages

License

ritukanchi/pdf-extraction

Folders and files

Latest commit

History

Repository files navigation

Overview

1. Prerequisites

2. Running with Docker

2.1 Build and Start the Containers

2.2 Stopping the Containers

3. Running Locally (Without Docker)

3.1 Create a Virtual Environment

3.2 Install Dependencies

3.3 Run the Flask Server

4. API Endpoints

4.1 Extract Text from a PDF (Asynchronous)

4.2 Check Task Status

5. Debugging & Logs

5.1 View Docker Logs

5.2 Check if Tesseract is Installed in Docker

5.3 Kill and Restart Docker Services

6. Sample PDFs you can test with

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages