📄 PDF Parser & Summarizer | Document AI | NLP | Data Extraction

🔰 Introduction

This project is an AI-powered PDF Parser & Summarizer that goes beyond the basic requirement of extracting structured content from PDFs.

It parses PDF documents into a well-structured JSON format (capturing sections, paragraphs, and tables).
It also integrates a state-of-the-art Hugging Face summarization model to generate concise summaries of the extracted text and tables.

This makes the tool highly useful for anyone who needs both structured data extraction and AI-driven insights—a strong value-add over traditional parsers.

🔗 Links

🚀 Live Demo (Streamlit App): pdfparsersummarizer.streamlit.app
🤗 Hugging Face Model Used: sshleifer/distilbart-cnn-12-6
📄 About the Project: README

🖼️ Project Preview

✨ Features

📂 PDF Parsing → Extracts paragraphs, sections, and tables with page-level hierarchy.
📝 AI Summarization (USP) → Generates concise summaries using a Hugging Face Transformer model.
📊 Metadata Insights → Displays number of pages, extracted paragraphs, and word count.
⬇️ Export Options → Download parsed JSON and summary as files.
🌐 Streamlit Web App → User-friendly, interactive interface.
⚡ Robust Parsing → Handles multiple content formats (text + tables).
🎨 Clean UI → JSON viewer, summary tab, and interactive metrics.

🛠️ Tools & Technologies

Category	Technologies
Programming	Python
Frontend (UI)	Streamlit
NLP Model	Hugging Face Transformers (`sshleifer/distilbart-cnn-12-6`)
Deep Learning	PyTorch
PDF Parsing	PyMuPDF (`fitz`), pdfplumber
Utilities	tqdm, sentencepiece
Deployment	Streamlit Cloud

⚙️ How It Works

Upload PDF
- User uploads any PDF file via the Streamlit app.
Parsing Stage
- parser.py uses PyMuPDF and pdfplumber to:
  - Extract text and detect sections/sub-sections.
  - Identify and extract tables.
  - Structure everything into a clean JSON format with metadata.
Summarization Stage (USP)
- summarizer.py loads the Hugging Face model sshleifer/distilbart-cnn-12-6.
- Text is tokenized and either summarized directly (short docs) or chunked into parts (long docs).
- Extracted tables are included as table snippets in the summary.
- A meta-summary condenses chunked outputs into a final concise overview.
Visualization & Output
- Parsed JSON → displayed in an expandable JSON viewer.
- AI Summary → shown in a dedicated summary tab.
- Metadata → displayed with Streamlit metric cards.
- Both JSON and summary → available for download.

👀 Preview (App Tabs)

📑 JSON Preview:
📝 Summary:
📊 Metadata:

📂 Folder Structure

PDFParserSummarizer/
│── About_project.pdf   # Project description document
│── README.md           # Documentation
│── app.py              # Streamlit frontend
│── app_preview.png     # UI preview screenshot
│── json_preview.png    # JSON viewer preview
│── metadata_preview.png# Metadata tab preview
│── parser.py           # PDF parsing logic
│── requirements.txt    # Dependencies
│── sample.pdf          # Sample PDF for testing
│── summarizer.py       # Hugging Face summarization logic
└── summary_preview.png # Summary tab preview

💡 Use Cases

📚 Research Papers → Parse and summarize lengthy academic PDFs.
📈 Business Reports → Extract tables + text, then summarize into insights.
🏛️ Legal Documents → Get concise summaries of contracts or case files.
📰 Articles/Whitepapers → Quickly digest long documents.
🗄️ General Archival → Store both structured JSON and human-readable summary.

⚡ Setup Instructions

1. Clone the Repository

git clone https://github.com/your-username/pdf-parser-summarizer.git
cd pdf-parser-summarizer

2. Create Virtual Environment (Recommended)

python -m venv venv
source venv/bin/activate   # On Mac/Linux
venv\Scripts\activate      # On Windows

3. Install Dependencies

pip install -r requirements.txt

4. Run Locally

streamlit run app.py

🌟 Unique Selling Point (USP)

Unlike typical PDF parsers that only extract raw content, this project integrates AI-powered summarization.

Summarization works seamlessly with extracted text and tables.
Handles long documents using intelligent chunking.
Produces clear, concise insights in addition to structured JSON.

This combination of Parsing + Summarization makes the project stand out as a Document AI system, not just a parser.

🙋‍♀️ Author

Anushka Sharma
🌐 LinkedIn • 🐱 GitHub 🎓 Learning Data Science, Analytics & Machine Learning

⭐ Show Your Support

If you found this project helpful or inspiring:

⭐ Star this repository
🛠️ Fork it to build upon or adapt it for your own use
💬 Share feedback or suggestions via Issues/Discussions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 PDF Parser & Summarizer | Document AI | NLP | Data Extraction

🔰 Introduction

🔗 Links

🖼️ Project Preview

✨ Features

🛠️ Tools & Technologies

⚙️ How It Works

👀 Preview (App Tabs)

📂 Folder Structure

💡 Use Cases

⚡ Setup Instructions

1. Clone the Repository

2. Create Virtual Environment (Recommended)

3. Install Dependencies

4. Run Locally

🌟 Unique Selling Point (USP)

🙋‍♀️ Author

⭐ Show Your Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
About_project.pdf		About_project.pdf
README.md		README.md
app.py		app.py
app_preview.png		app_preview.png
json_preview.png		json_preview.png
metadata_preview.png		metadata_preview.png
parser.py		parser.py
requirements.txt		requirements.txt
sample.pdf		sample.pdf
summarizer.py		summarizer.py
summary_preview.png		summary_preview.png

Folders and files

Latest commit

History

Repository files navigation

📄 PDF Parser & Summarizer | Document AI | NLP | Data Extraction

🔰 Introduction

🔗 Links

🖼️ Project Preview

✨ Features

🛠️ Tools & Technologies

⚙️ How It Works

👀 Preview (App Tabs)

📂 Folder Structure

💡 Use Cases

⚡ Setup Instructions

1. Clone the Repository

2. Create Virtual Environment (Recommended)

3. Install Dependencies

4. Run Locally

🌟 Unique Selling Point (USP)

🙋‍♀️ Author

⭐ Show Your Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages