Skip to content

Convert PDFs (including scanned ones) to structured Markdown using local LLMs via Ollama — 100% Julia, no cloud, no API keys.

Notifications You must be signed in to change notification settings

gwangjinkim/PDF2Markdown.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

PDF2Markdown.jl

Convert scanned and regular PDFs into structured Markdown using local LLMs (via Ollama) — 100% Julia, no cloud, no APIs, no nonsense.

OCR, layout parsing, and Markdown generation in one clean Julia pipeline.


Features

  • Converts each page of a PDF to high-resolution PNG images using Poppler_jll
  • Sends images to a local Ollama instance (e.g., gemma3:12b)
  • 100% local — no internet required, your data stays on your machine
  • Outputs structured, clean Markdown

Installation

using Pkg
Pkg.add(url="https://github.com/your-username/PDF2Markdown.jl")

Or clone the repo manually and Pkg.develop.


Usage

using PDF2Markdown

text = extract_text_from_pdf("path/to/your.pdf")
println(text)

You must have ollama installed and running locally, with a model like gemma3:12b pulled.


Requirements

  • Julia ≥ 1.9
  • Ollama running locally
  • A pulled model like gemma3:12b or gemma3:4b
  • Poppler must be functional via Poppler_jll

Example

text = extract_text_from_pdf("document.pdf")
write("output.md", text)

Notes

  • Internally uses:
    • Poppler_jll to convert PDF pages to images
    • Base64 to encode images for Ollama
    • HTTP.jl and JSON3.jl to communicate with the LLM
  • If you're converting large PDFs, consider batching pages.

Related Project

🔗 Python version: pdf2md-ollama


Credits

Inspired by this article on Medium


License

MIT

About

Convert PDFs (including scanned ones) to structured Markdown using local LLMs via Ollama — 100% Julia, no cloud, no API keys.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages