This project is an AI-powered PDF Parser & Summarizer that goes beyond the basic requirement of extracting structured content from PDFs.
- It parses PDF documents into a well-structured JSON format (capturing sections, paragraphs, and tables).
- It also integrates a state-of-the-art Hugging Face summarization model to generate concise summaries of the extracted text and tables.
This makes the tool highly useful for anyone who needs both structured data extraction and AI-driven insights—a strong value-add over traditional parsers.
- 🚀 Live Demo (Streamlit App): pdfparsersummarizer.streamlit.app
- 🤗 Hugging Face Model Used: sshleifer/distilbart-cnn-12-6
- 📄 About the Project: README
- 📂 PDF Parsing → Extracts paragraphs, sections, and tables with page-level hierarchy.
- 📝 AI Summarization (USP) → Generates concise summaries using a Hugging Face Transformer model.
- 📊 Metadata Insights → Displays number of pages, extracted paragraphs, and word count.
- ⬇️ Export Options → Download parsed JSON and summary as files.
- 🌐 Streamlit Web App → User-friendly, interactive interface.
- ⚡ Robust Parsing → Handles multiple content formats (text + tables).
- 🎨 Clean UI → JSON viewer, summary tab, and interactive metrics.
| Category | Technologies |
|---|---|
| Programming | Python |
| Frontend (UI) | Streamlit |
| NLP Model | Hugging Face Transformers (sshleifer/distilbart-cnn-12-6) |
| Deep Learning | PyTorch |
| PDF Parsing | PyMuPDF (fitz), pdfplumber |
| Utilities | tqdm, sentencepiece |
| Deployment | Streamlit Cloud |
-
Upload PDF
- User uploads any PDF file via the Streamlit app.
-
Parsing Stage
parser.pyuses PyMuPDF and pdfplumber to:- Extract text and detect sections/sub-sections.
- Identify and extract tables.
- Structure everything into a clean JSON format with metadata.
-
Summarization Stage (USP)
summarizer.pyloads the Hugging Face modelsshleifer/distilbart-cnn-12-6.- Text is tokenized and either summarized directly (short docs) or chunked into parts (long docs).
- Extracted tables are included as table snippets in the summary.
- A meta-summary condenses chunked outputs into a final concise overview.
-
Visualization & Output
- Parsed JSON → displayed in an expandable JSON viewer.
- AI Summary → shown in a dedicated summary tab.
- Metadata → displayed with Streamlit metric cards.
- Both JSON and summary → available for download.
PDFParserSummarizer/
│── About_project.pdf # Project description document
│── README.md # Documentation
│── app.py # Streamlit frontend
│── app_preview.png # UI preview screenshot
│── json_preview.png # JSON viewer preview
│── metadata_preview.png# Metadata tab preview
│── parser.py # PDF parsing logic
│── requirements.txt # Dependencies
│── sample.pdf # Sample PDF for testing
│── summarizer.py # Hugging Face summarization logic
└── summary_preview.png # Summary tab preview
- 📚 Research Papers → Parse and summarize lengthy academic PDFs.
- 📈 Business Reports → Extract tables + text, then summarize into insights.
- 🏛️ Legal Documents → Get concise summaries of contracts or case files.
- 📰 Articles/Whitepapers → Quickly digest long documents.
- 🗄️ General Archival → Store both structured JSON and human-readable summary.
git clone https://github.com/your-username/pdf-parser-summarizer.git
cd pdf-parser-summarizerpython -m venv venv
source venv/bin/activate # On Mac/Linux
venv\Scripts\activate # On Windows
pip install -r requirements.txt
streamlit run app.py
Unlike typical PDF parsers that only extract raw content, this project integrates AI-powered summarization.
- Summarization works seamlessly with extracted text and tables.
- Handles long documents using intelligent chunking.
- Produces clear, concise insights in addition to structured JSON.
This combination of Parsing + Summarization makes the project stand out as a Document AI system, not just a parser.
Anushka Sharma
🌐 LinkedIn • 🐱 GitHub
🎓 Learning Data Science, Analytics & Machine Learning
If you found this project helpful or inspiring:
- ⭐ Star this repository
- 🛠️ Fork it to build upon or adapt it for your own use
- 💬 Share feedback or suggestions via Issues/Discussions



