PDF Parser FastAPI Service

The PDF Parser API is an intelligent and modular document processing service that converts unstructured PDF files into structured, machine-readable data. It automatically detects and extracts text blocks, tables, key-value pairs, and images, and returns the results in a hierarchical JSON format, complete with page-level and spatial metadata.

Designed with flexibility in mind, the API supports selective extraction modes—you can choose to extract only text, only images, or both, depending on your specific use case.

Built using FastAPI and organized with a service-layer architecture, the project is cleanly structured for maintainability, scalability, and ease of integration.

This API is particularly powerful when used in Agentic AI systems and RAG (Retrieval-Augmented Generation) pipelines, where accurate understanding of document layout and semantics is crucial for intelligent retrieval and contextual reasoning.

Features

Extracts text, tables, and key-value pairs from PDF documents
Extracts images (figures) from PDF documents
Returns structured JSON output with chunk types and page numbers
Modular codebase for easy extension and maintenance

Project Structure

pdfparser/
│
├── main.py                # FastAPI app and API routes
├── models.py              # Pydantic models for request/response
├── requirements.txt       # Python dependencies
└── services/
    ├── chunk_service.py   # Text chunk labeling, table detection, key-value extraction
    ├── image_service.py   # PDF image extraction logic
    └── pdf_service.py     # PDF text extraction logic

Requirements

Python 3.8+

Installation

Create a virtual environment

python -m venv .venv

Activate the virtual environment

On Windows:

.venv\Scripts\activate

On macOS/Linux:

source .venv/bin/activate

Install dependencies

pip install -r requirements.txt

Running the Service

Start the FastAPI server using Uvicorn:

uvicorn main:app --reload

The API will be available at http://127.0.0.1:8000.

API Usage

Endpoint: `/extract`

Method: POST

Description: Extracts text and/or images from an uploaded PDF file.

Query Parameters:

mode: text, images, or both (default: both)

Request:

file: PDF file to upload (as form-data)

Response:

JSON object with a chunks list, each containing:
- chunk_id: Unique identifier
- type: text, table, figure, or marginalia
- page: Page number
- content: Extracted content (text, HTML table, or image filename)

Example using curl:

curl -X POST "http://127.0.0.1:8000/extract?mode=both" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "[email protected]"

Code Overview

main.py: Defines the FastAPI app and the /extract endpoint. Delegates PDF processing to service classes.
models.py: Contains Pydantic models for API responses.
services/pdf_service.py: Handles text extraction and chunking from PDFs.
services/image_service.py: Handles image extraction from PDFs.
services/chunk_service.py: Contains logic for labeling text chunks, extracting key-value pairs, and converting tables to HTML.

Extending the Service

Add new chunk types or extraction logic in chunk_service.py.
Add new endpoints in main.py as needed.
Use Pydantic models in models.py for request/response validation.

Contributing

Contributions are welcome! Please submit a pull request or open an issue to discuss any changes.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
services		services
.gitignore		.gitignore
README.md		README.md
main.py		main.py
models.py		models.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

PDF Parser FastAPI Service

Features

Project Structure

Requirements

Installation

Running the Service

API Usage

Endpoint: `/extract`

Code Overview

Extending the Service

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Uh oh!

Uh oh!

purunep/pdfparser

Folders and files

Latest commit

History

Repository files navigation

PDF Parser FastAPI Service

Features

Project Structure

Requirements

Installation

Running the Service

API Usage

Endpoint: /extract

Code Overview

Extending the Service

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Endpoint: `/extract`

Packages