Skip to content

Commit 55154cf

Browse files
committed
initial commit
0 parents  commit 55154cf

File tree

4 files changed

+199
-0
lines changed

4 files changed

+199
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
/env
2+
/chat-with-pdf

README.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Chat with Your PDFs using RAG
2+
3+
This project allows you to upload a PDF and ask questions about its content using **Deepseek R1** via **Ollama**. The application processes PDFs, extracts text, indexes them into a vector store, and retrieves relevant context to generate concise answers.
4+
5+
## Features
6+
7+
- 📂 **Upload a PDF**: Select a PDF file to process.
8+
- 🔍 **Text Extraction & Indexing**: Extracts content and indexes it for efficient search.
9+
- 💡 **Question-Answering**: Ask questions related to the PDF content and get relevant answers.
10+
- 🚀 **Powered by Ollama & LangChain**: Uses `Deepseek R1` for embeddings and responses.
11+
12+
## Installation
13+
14+
### Prerequisites
15+
16+
- Python 3.8+
17+
- [Ollama](https://ollama.com) installed
18+
- Dependencies installed via pip
19+
20+
### Setup
21+
22+
1. Clone this repository:
23+
24+
```sh
25+
git clone https://github.com/hasan-py/chat-with-pdf-RAG.git
26+
cd chat-with-pdf-RAG
27+
```
28+
29+
Activate your python env and install the dependencies.
30+
31+
2. Install dependencies:
32+
```sh
33+
pip install -r requirements.txt
34+
```
35+
3. Run the Streamlit app:
36+
```sh
37+
streamlit run pdf_rag.py
38+
```
39+
40+
## How It Works
41+
42+
1. **Upload a PDF**: Use the UI to upload a document.
43+
2. **Processing**: The app extracts text and chunks it for indexing.
44+
3. **Ask Questions**: Enter a question in the chat box.
45+
4. **Get Answers**: The system retrieves relevant text and responds concisely.
46+
47+
## How to change model?
48+
49+
To change the model used for inference, you can modify the `LLM` variable in the `pdf_rag.py` file. The `LLM` variable is initialized with the `deepseek-r1:8b` model by default. You can replace it with any other model supported by `Ollama`.
50+
51+
## File Structure
52+
53+
```
54+
chat-with-pdf/
55+
│── pdfs/ # Directory for uploaded PDFs
56+
│── app.py # Main Streamlit app
57+
│── requirements.txt # Dependencies
58+
│── README.md # Documentation
59+
```
60+
61+
## Technologies Used
62+
63+
- **Python**
64+
- **Streamlit** (for UI)
65+
- **LangChain** (for text processing)
66+
- **Ollama** (for LLM inference)
67+
- **PDFPlumber** (for PDF extraction)
68+
69+
## Contributing
70+
71+
Feel free to submit issues and PRs to improve the project!
72+
73+
## Acknowledgments
74+
75+
Special thanks to the creators of **LangChain**, **Ollama**, **Streamlit** and the **community** for enabling this functionality.

pdf_rag.py

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
import os
2+
import streamlit as st
3+
from langchain_community.document_loaders import PDFPlumberLoader
4+
from langchain_text_splitters import RecursiveCharacterTextSplitter
5+
from langchain_core.vectorstores import InMemoryVectorStore
6+
from langchain_ollama import OllamaEmbeddings
7+
from langchain_core.prompts import ChatPromptTemplate
8+
from langchain_ollama.llms import OllamaLLM
9+
10+
LLM = "deepseek-r1:8b"
11+
12+
# Prompt template for answering questions
13+
template = """
14+
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
15+
Question: {question}
16+
Context: {context}
17+
Answer:
18+
"""
19+
20+
# Directory to save uploaded PDFs
21+
pdfs_directory = "chat-with-pdf/pdfs/"
22+
23+
# Ensure the directory exists
24+
os.makedirs(pdfs_directory, exist_ok=True)
25+
26+
# Initialize embeddings and model
27+
embeddings = OllamaEmbeddings(model=LLM)
28+
model = OllamaLLM(model=LLM)
29+
30+
# Initialize vector store
31+
vector_store = None
32+
33+
34+
def upload_pdf(file):
35+
"""Save the uploaded PDF to the specified directory."""
36+
try:
37+
file_path = os.path.join(pdfs_directory, file.name)
38+
with open(file_path, "wb") as f:
39+
f.write(file.getbuffer())
40+
return file_path
41+
except Exception as e:
42+
st.error(f"Error saving file: {e}")
43+
return None
44+
45+
46+
def load_pdf(file_path):
47+
"""Load the content of the PDF using PDFPlumberLoader."""
48+
try:
49+
loader = PDFPlumberLoader(file_path)
50+
return loader.load()
51+
except Exception as e:
52+
st.error(f"Error loading PDF: {e}")
53+
return None
54+
55+
56+
def split_text(documents):
57+
"""Split the documents into smaller chunks for indexing."""
58+
text_splitter = RecursiveCharacterTextSplitter(
59+
chunk_size=1000, chunk_overlap=200, add_start_index=True
60+
)
61+
return text_splitter.split_documents(documents)
62+
63+
64+
def index_docs(documents):
65+
"""Index the documents in the vector store."""
66+
global vector_store
67+
vector_store = InMemoryVectorStore(embeddings)
68+
vector_store.add_documents(documents)
69+
70+
71+
def retrieve_docs(query):
72+
"""Retrieve relevant documents based on the query."""
73+
return vector_store.similarity_search(query)
74+
75+
76+
def answer_question(question, documents):
77+
"""Generate an answer to the question using the retrieved documents."""
78+
context = "\n\n".join([doc.page_content for doc in documents])
79+
prompt = ChatPromptTemplate.from_template(template)
80+
chain = prompt | model
81+
return chain.invoke({"question": question, "context": context})
82+
83+
84+
# Streamlit UI
85+
st.title("Chat with Your PDF")
86+
uploaded_file = st.file_uploader(
87+
"Upload a PDF file to get started", type="pdf", accept_multiple_files=False
88+
)
89+
90+
if uploaded_file:
91+
# Save the uploaded PDF
92+
file_path = upload_pdf(uploaded_file)
93+
94+
if file_path:
95+
st.success(f"File uploaded successfully: {uploaded_file.name}")
96+
97+
# Load and process the PDF
98+
with st.spinner("Processing PDF..."):
99+
documents = load_pdf(file_path)
100+
if documents:
101+
chunked_documents = split_text(documents)
102+
index_docs(chunked_documents)
103+
st.success("PDF indexed successfully! Ask your questions below.")
104+
105+
# Chat input
106+
question = st.chat_input("Ask a question about the uploaded PDF:")
107+
108+
if question:
109+
st.chat_message("user").write(question)
110+
111+
with st.spinner("Retrieving relevant information..."):
112+
related_documents = retrieve_docs(question)
113+
if related_documents:
114+
answer = answer_question(question, related_documents)
115+
st.chat_message("assistant").write(answer)
116+
else:
117+
st.chat_message("assistant").write("No relevant information found.")

requirements.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
streamlit
2+
langchain_core
3+
langchain_community
4+
langchain_ollama
5+
pdfplumber

0 commit comments

Comments
 (0)