PDF Extraction
This Nextjs , Flask-based API extracts text from PDFs, supporting both searchable and non-searchable PDFs (via OCR using Tesseract). It includes asynchronous processing with a status tracking system.
Before running the project, ensure you have the following installed:
- Docker and Docker Compose
- Python 3.10+ (if running without Docker)
- Tesseract OCR (
tesseract-ocr
installed in Docker)
From the project root directory, run:
docker-compose up --build
This will:
- Build the backend (Flask) and frontend (Next.js) images
- Start both services and expose them on:
- Backend →
http://localhost:5000
- Frontend →
http://localhost:3000
- Backend →
To run the backend only:
docker-compose up --build backend
To stop all services, use:
docker-compose down
To isolate dependencies, create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Install the required Python packages:
pip install -r backend/requirements.txt
Navigate to the backend directory and start the server:
cd backend
python app.py
Flask will start on http://127.0.0.1:5000
.
Endpoint:
POST /extract
Request Body (JSON):
{
"pdf_url": "https://example.com/sample.pdf"
}
Response (202 Accepted - Task Queued):
{
"task_id": "123e4567-e89b-12d3-a456-426614174000"
}
Endpoint:
GET /status/{task_id}
Response (Still Processing - 202):
{
"status": "processing",
"progress": 25,
"total_pages": 100
}
Response (Completed - 200):
{
"task_id": "123e4567-e89b-12d3-a456-426614174000",
"text_chunks": [
{
"text": "Extracted text from page",
"bbox": [50, 100, 200, 150],
"page": 0
}
],
"metadata": {
"pages": 100,
"size": 204800
}
}
docker logs -f pdf-extraction-backend-1
docker exec -it pdf-extraction-backend-1 bash
tesseract -v
docker-compose down
docker-compose up --build