PDF Extraction API

PDF Extraction API is a simple API built on top of the Marker PDF library. It allows you to parse and convert PDF files into Markdown format, extract images, and retrieve the parsed content through local and web endpoints.

Installation

Clone the repository: git clone https://github.com/satish860/PDF-Extraction-API.git
Install the required dependencies: pip install modal marker-pdf hf_transfer opencv-python-headless pillow

Usage

Local Endpoint

To run the PDF parser and converter locally, use the following command:

modal run app.py

This will execute the main function defined in app.py, which downloads a PDF from a specified URL, processes it using Marker PDF, and saves the parsed content and images locally.

Web Endpoint

To run the PDF parser and converter as a web endpoint, use the following command:

modal serve app.py

This will start a web server and expose the /convert endpoint. You can send a POST request to this endpoint with the PDF chunk as a base64-encoded string in the request body. The endpoint will process the PDF using Marker PDF and return the parsed Markdown content, base64-encoded images, and metadata as a JSON response.

Configuration

Before running the application, make sure to set the following environment variables:

export HF_HUB_ENABLE_HF_TRANSFER=1

export TRANSFORMERS_CACHE=/data/transformers_cache

export HF_HOME=/data/hf_home

These environment variables are required for the proper functioning of the Marker PDF library.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
app.py		app.py
parsed_content.md		parsed_content.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Extraction API

Installation

Usage

Local Endpoint

Web Endpoint

Configuration

License

About

Releases

Packages

Languages

satish860/PDF-Extraction-API

Folders and files

Latest commit

History

Repository files navigation

PDF Extraction API

Installation

Usage

Local Endpoint

Web Endpoint

Configuration

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages