Skip to content

The PDF Parser API is an intelligent and modular document processing service that converts unstructured PDF files into structured, machine-readable data. It automatically detects and extracts text blocks, tables, key-value pairs, and images, and returns the results in a hierarchical JSON format, complete with page-level and spatial metadata.

purunep/pdfparser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Parser FastAPI Service

The PDF Parser API is an intelligent and modular document processing service that converts unstructured PDF files into structured, machine-readable data. It automatically detects and extracts text blocks, tables, key-value pairs, and images, and returns the results in a hierarchical JSON format, complete with page-level and spatial metadata.

Designed with flexibility in mind, the API supports selective extraction modes—you can choose to extract only text, only images, or both, depending on your specific use case.

Built using FastAPI and organized with a service-layer architecture, the project is cleanly structured for maintainability, scalability, and ease of integration.

This API is particularly powerful when used in Agentic AI systems and RAG (Retrieval-Augmented Generation) pipelines, where accurate understanding of document layout and semantics is crucial for intelligent retrieval and contextual reasoning.

Features

  • Extracts text, tables, and key-value pairs from PDF documents
  • Extracts images (figures) from PDF documents
  • Returns structured JSON output with chunk types and page numbers
  • Modular codebase for easy extension and maintenance

Project Structure

pdfparser/
│
├── main.py                # FastAPI app and API routes
├── models.py              # Pydantic models for request/response
├── requirements.txt       # Python dependencies
└── services/
    ├── chunk_service.py   # Text chunk labeling, table detection, key-value extraction
    ├── image_service.py   # PDF image extraction logic
    └── pdf_service.py     # PDF text extraction logic

Requirements

  • Python 3.8+

Installation

  1. Create a virtual environment
python -m venv .venv
  1. Activate the virtual environment

On Windows:

.venv\Scripts\activate

On macOS/Linux:

source .venv/bin/activate
  1. Install dependencies
pip install -r requirements.txt

Running the Service

Start the FastAPI server using Uvicorn:

uvicorn main:app --reload

The API will be available at http://127.0.0.1:8000.

API Usage

Endpoint: /extract

Method: POST

Description: Extracts text and/or images from an uploaded PDF file.

Query Parameters:

  • mode: text, images, or both (default: both)

Request:

  • file: PDF file to upload (as form-data)

Response:

  • JSON object with a chunks list, each containing:
    • chunk_id: Unique identifier
    • type: text, table, figure, or marginalia
    • page: Page number
    • content: Extracted content (text, HTML table, or image filename)

Example using curl:

curl -X POST "http://127.0.0.1:8000/extract?mode=both" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "[email protected]"

Code Overview

  • main.py: Defines the FastAPI app and the /extract endpoint. Delegates PDF processing to service classes.
  • models.py: Contains Pydantic models for API responses.
  • services/pdf_service.py: Handles text extraction and chunking from PDFs.
  • services/image_service.py: Handles image extraction from PDFs.
  • services/chunk_service.py: Contains logic for labeling text chunks, extracting key-value pairs, and converting tables to HTML.

Extending the Service

  • Add new chunk types or extraction logic in chunk_service.py.
  • Add new endpoints in main.py as needed.
  • Use Pydantic models in models.py for request/response validation.

Contributing

Contributions are welcome! Please submit a pull request or open an issue to discuss any changes.

License

This project is licensed under the MIT License.

About

The PDF Parser API is an intelligent and modular document processing service that converts unstructured PDF files into structured, machine-readable data. It automatically detects and extracts text blocks, tables, key-value pairs, and images, and returns the results in a hierarchical JSON format, complete with page-level and spatial metadata.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages