Convert PDFs, clean Markdown, inspect chunks, and enrich metadata for reliable RAG pipelines.
If you like this project, a star ⭐️ would mean a lot :)
Chunky is a local, open-source workspace for preparing documents for Retrieval-Augmented Generation (RAG). It combines PDF-to-Markdown conversion, Markdown cleanup, chunk visualization, chunking strategy comparison, and LLM-powered enrichment in one workflow.
Most RAG failures start before retrieval: broken tables, scrambled layouts, noisy Markdown, or chunks that look fine in code but fail in context. Chunky makes those steps visible so you can inspect and fix the document before it reaches your vector store.
As NVIDIA's research shows, no chunking strategy wins universally. Chunky helps you compare strategies on the actual document instead of treating chunking as a hidden parameter.
New to RAG? Check out Agentic RAG for Dummies — a hands-on implementation of Agentic RAG.
| 📄 Document review workspace | Compare PDF, Markdown, and chunks side by side before indexing |
| ✨ Multiple conversion engines | PyMuPDF, Docling, MarkItDown, LiteParse, VLM, and Cloud API support |
| 📦 Batch processing | Convert, enrich, and chunk multiple documents from the sidebar |
| ✂️ Chunking strategy comparison | Test LangChain, Chonkie, and Docling splitters with configurable size and overlap |
| 💾 Saved chunk versions | Persist and reload chunk sets by Markdown source and splitter configuration |
| 🧠 Markdown enrichment | Clean conversion artifacts with deterministic cleanup plus LLM correction |
| ✨ Chunk enrichment | Generate context-aware titles, summaries, keywords, and retrieval questions |
| 🔌 Pluggable backend | Add converters or splitters through the registry without frontend changes |
Two ways to run Chunky: locally or with Docker.
git clone https://github.com/GiovanniPasq/chunky.git
cd chunky
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
./start_all.shgit clone https://github.com/GiovanniPasq/chunky.git
cd chunky
docker compose up --build| Service | URL |
|---|---|
| Frontend | http://localhost:5173 |
| Backend | http://localhost:8000 |
| Swagger | http://localhost:8000/docs |
No single converter wins on every document type. Chunky ships with six — switch between them in the UI and re-convert without losing your settings.
| Converter | Library | Best for |
|---|---|---|
| PyMuPDF | pymupdf4llm |
Fast conversion of standard digital PDFs with selectable text |
| Docling | docling |
Complex layouts: multi-column documents, tables, and figures |
| MarkItDown | markitdown[pdf] |
Broad-format documents, simple and deterministic output |
| LiteParse | liteparse |
Fast, lightweight parsing by LlamaIndex — good for standard documents |
| VLM | openai + any vision model |
Scanned PDFs, handwriting, diagrams — anything a human can read |
| Cloud API | httpx |
POSTs the PDF to a configurable external endpoint and returns the Markdown response body directly |
The VLM converter rasterises each page at 300 DPI and sends it to any OpenAI-compatible vision model. By default it targets a locally running Ollama instance — no API key, no internet access required.
Through the frontend, you can configure the model name, base URL, and API key directly in the UI before requesting a conversion — no code changes needed.
Note: Conversion speed with Docling or a locally running Ollama instance depends heavily on available hardware. On CPU-only machines, both can be significantly slower than on systems with a dedicated GPU.
Ollama configuration: when using a local Ollama instance, the most relevant environment variables are
OLLAMA_NUM_PARALLEL,OLLAMA_MAX_LOADED_MODELS,OLLAMA_KEEP_ALIVE, andOLLAMA_MAX_QUEUE. See the Ollama FAQ for setup instructions.
Chunky supports two splitting libraries, each exposing multiple strategies. The library and strategy are selected independently in the UI.
| Strategy | Description |
|---|---|
| Token | Splits on token boundaries via tiktoken. Ideal for LLM context-window management. |
| Recursive | Tries paragraph → sentence → word boundaries in order. |
| Character | Splits on \n\n paragraphs, falls back to chunk_size characters. |
| Markdown | Two-phase split: H1/H2/H3 headers first, then optional size cap via RecursiveCharacterTextSplitter. |
| Strategy | Description |
|---|---|
| Token | Splits on token boundaries. Fast, no external tokeniser needed. |
| Fast | SIMD-accelerated byte-based chunking at 100+ GB/s. Best for high-throughput pipelines. |
| Sentence | Splits at sentence boundaries. Preserves semantic completeness. |
| Recursive | Recursively splits using structural delimiters (paragraphs → sentences → words). Note: chunk_overlap is not supported. |
| Table | Splits large Markdown tables by row while preserving headers. Ideal for tabular data. |
| Code | Splits source code using AST-based structural analysis. Supports multiple languages. |
| Semantic | Groups content by embedding similarity. Best for preserving topical coherence. |
| Neural | Uses a fine-tuned BERT model to detect semantic shifts. Great for topic-coherent chunks. |
Note: The Semantic and Neural strategies download ML models on first use and may be slow to initialise.
| Strategy | Description |
|---|---|
| Hybrid | Hierarchical document-aware chunking with tokenization-aware refinements. Merges undersized chunks and supports header repetition across table splits. |
| Line-Based | Preserves line boundaries with optional repeated prefix per chunk (e.g. table headers). Best for tables, code, and logs where line integrity matters. |
Note: Both Docling strategies operate on
DoclingDocumentobjects and require thedoclinglibrary. The Hybrid strategy downloads a tokenizer model on first use.
Chunky includes an LLM-powered enrichment layer that operates at two levels of the pipeline.
Before chunking, you can run enrichment directly on the converted Markdown. The pipeline:
- Regex pass — automatically corrects common conversion artifacts (broken tables, stray escape characters, malformed headers)
- LLM correction — splits the document into pieces and sends each to an LLM for contextual cleanup, producing coherent, well-structured Markdown
- Summary (optional) — generates a document-level summary used as context during LLM correction
Markdown enrichment is available for both single files and bulk operations, so you can clean an entire batch of converted PDFs in one pass.
After chunking, selected chunks can be enriched via LLM calls. Sidebar bulk enrichment can also enrich saved chunk sets, or chunk first when no matching saved set exists.
Each call analyzes the selected chunk itself and, when available, also receives:
- the cached document-level summary, generated by the Markdown enrichment flow
- a small read-only window of Markdown immediately before and after the chunk
Those extra inputs help the model disambiguate names, acronyms, headings, and section intent without copying neighboring text into the enriched chunk. The pipeline populates the following fields:
| Field | Description |
|---|---|
cleaned_chunk |
Cleaned and normalized version of the original text |
title |
Short descriptive title for the chunk |
context |
One sentence describing where the chunk fits within the broader document |
summary |
One sentence summary of the chunk content |
keywords |
Array of relevant keyword strings |
questions |
Array of questions this chunk could answer |
The context field is inspired by Anthropic's Contextual Retrieval technique, which shows that prepending a short chunk-specific context can reduce retrieval failure rates by up to 49%.
The questions field addresses a complementary problem: pre-generating the questions a chunk can answer produces embeddings much closer to real user queries at retrieval time, as highlighted in the Microsoft Azure RAG enrichment guide.
The converter and chunker layers use a decorator-based registry: adding a new converter or chunker automatically exposes it through the /api/capabilities endpoint and the UI — no frontend changes needed.
Every converter inherits from PDFConverter (backend/converters/base.py):
from abc import ABC, abstractmethod
from pathlib import Path
class PDFConverter(ABC):
@abstractmethod
def convert(self, pdf_path: Path) -> str:
"""Convert a PDF to a Markdown string."""
def validate_path(self, pdf_path: Path) -> None:
if not pdf_path.exists():
raise FileNotFoundError(f"PDF file not found: {pdf_path}")1. Create a new file in backend/converters/ and decorate the class:
# backend/converters/my_converter.py
from pathlib import Path
from backend.registry import register_converter
from .base import PDFConverter
@register_converter(
name="my_converter",
label="My Converter",
description="Short description shown in the UI.",
)
class MyConverter(PDFConverter):
def __init__(self) -> None:
from my_library import MyParser
self._parser = MyParser()
def convert(self, pdf_path: Path) -> str:
self.validate_path(pdf_path)
return self._parser.to_markdown(str(pdf_path))2. Import it in capabilities_router.py:
import backend.converters.my_converter # noqa: F401 — side-effect importDone. The new converter appears automatically in /api/capabilities and the UI.
from backend.registry import register_chunker
@register_chunker(
library="my_lib",
library_label="My Library",
strategy="my_strategy",
label="My Strategy",
description="Short description shown in the UI.",
)
def _chunk_my_strategy(self, request: ChunkRequest) -> list[ChunkItem]:
splits = my_chunker.split(request.content, request.chunk_size)
return self.build_chunks(request.content, splits, request.chunk_overlap)Import the module in capabilities_router.py and add the strategy to the chunker's _DISPATCH table. The strategy appears in the UI automatically.

