The PDF Parser API is an intelligent and modular document processing service that converts unstructured PDF files into structured, machine-readable data. It automatically detects and extracts text blocks, tables, key-value pairs, and images, and returns the results in a hierarchical JSON format, complete with page-level and spatial metadata.
Designed with flexibility in mind, the API supports selective extraction modes—you can choose to extract only text, only images, or both, depending on your specific use case.
Built using FastAPI and organized with a service-layer architecture, the project is cleanly structured for maintainability, scalability, and ease of integration.
This API is particularly powerful when used in Agentic AI systems and RAG (Retrieval-Augmented Generation) pipelines, where accurate understanding of document layout and semantics is crucial for intelligent retrieval and contextual reasoning.
- Extracts text, tables, and key-value pairs from PDF documents
- Extracts images (figures) from PDF documents
- Returns structured JSON output with chunk types and page numbers
- Modular codebase for easy extension and maintenance
pdfparser/
│
├── main.py # FastAPI app and API routes
├── models.py # Pydantic models for request/response
├── requirements.txt # Python dependencies
└── services/
├── chunk_service.py # Text chunk labeling, table detection, key-value extraction
├── image_service.py # PDF image extraction logic
└── pdf_service.py # PDF text extraction logic
- Python 3.8+
- Create a virtual environment
python -m venv .venv- Activate the virtual environment
On Windows:
.venv\Scripts\activateOn macOS/Linux:
source .venv/bin/activate- Install dependencies
pip install -r requirements.txtStart the FastAPI server using Uvicorn:
uvicorn main:app --reloadThe API will be available at http://127.0.0.1:8000.
Method: POST
Description: Extracts text and/or images from an uploaded PDF file.
Query Parameters:
mode:text,images, orboth(default:both)
Request:
file: PDF file to upload (as form-data)
Response:
- JSON object with a
chunkslist, each containing:chunk_id: Unique identifiertype:text,table,figure, ormarginaliapage: Page numbercontent: Extracted content (text, HTML table, or image filename)
Example using curl:
curl -X POST "http://127.0.0.1:8000/extract?mode=both" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "[email protected]"- main.py: Defines the FastAPI app and the
/extractendpoint. Delegates PDF processing to service classes. - models.py: Contains Pydantic models for API responses.
- services/pdf_service.py: Handles text extraction and chunking from PDFs.
- services/image_service.py: Handles image extraction from PDFs.
- services/chunk_service.py: Contains logic for labeling text chunks, extracting key-value pairs, and converting tables to HTML.
- Add new chunk types or extraction logic in
chunk_service.py. - Add new endpoints in
main.pyas needed. - Use Pydantic models in
models.pyfor request/response validation.
Contributions are welcome! Please submit a pull request or open an issue to discuss any changes.
This project is licensed under the MIT License.