A sophisticated Retrieval-Augmented Generation (RAG) system that transforms any web document into an intelligent knowledge base. Ask questions and get accurate, contextual answers powered by cutting-edge AI technology.
Click the image above to watch the full demo on YouTube
Interactive web interface for document loading and AI-powered question answering
π Universal Web Scraping - Load documents from any URL
π§ Smart Document Processing - Intelligent text chunking with overlap
π Semantic Search - Vector-based similarity search
π¬ GPT-4 Integration - State-of-the-art answer generation
ποΈ Full Transparency - View source documents for every answer
β‘ Real-time Processing - Instant document indexing and querying
π¨ Modern UI - Clean, responsive Streamlit interface
git clone https://github.com/yourusername/rag-langchain-master.git
cd rag-langchain-master# Create virtual environment
python -m venv venv
# Activate (Windows)
venv\Scripts\activate
# Activate (macOS/Linux)
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtCreate a .env file in the project root:
OPENAI_API_KEY=your_openai_api_key_herestreamlit run apps/web_rag.pyNavigate to http://localhost:8501 and start asking questions! π
graph LR
A[Web URL] --> B[Document Loader]
B --> C[Text Splitter]
C --> D[Embeddings]
D --> E[Vector Store]
F[User Question] --> G[Retriever]
G --> E
E --> H[Relevant Chunks]
H --> I[GPT-4]
I --> J[Final Answer]
- π Document Ingestion: WebBaseLoader extracts content from URLs
- βοΈ Text Chunking: Smart splitting with configurable overlap
- π’ Vectorization: HuggingFace embeddings create semantic representations
- ποΈ Storage: In-memory vector database for lightning-fast retrieval
- π Retrieval: Semantic search finds most relevant content
- π€ Generation: GPT-4 synthesizes accurate answers with context
| Component | Technology | Purpose |
|---|---|---|
| Framework | LangChain | LLM application orchestration |
| Frontend | Streamlit | Interactive web interface |
| LLM | OpenAI GPT-4 | Answer generation |
| Embeddings | HuggingFace Transformers | Semantic text representation |
| Vector DB | In-Memory Store | Fast similarity search |
| Loader | WebBaseLoader | Document extraction |
rag-langchain-master/
βββ π± apps/
β βββ web_rag.py # Main Streamlit application
βββ π requirements.txt # Python dependencies
βββ π .env # Environment variables
βββ π README.md # This file
βββ π LICENSE # MIT license
βββ πΈ screenshots/ # Demo images
βββ app-preview.png
RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters per chunk
chunk_overlap=200, # Overlap between chunks
separators=["\n\n", "\n", " ", ""] # Split priorities
)HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)ChatOpenAI(model_name="gpt-4") # Configurable model- π Research Assistant: Query academic papers and documentation
- π° News Analysis: Extract insights from news articles
- π Policy Documents: Navigate complex legal/policy texts
- π’ Corporate Knowledge: Build internal knowledge bases
- π Educational Content: Interactive learning from web resources
Extend the loader to support PDFs, Word docs, and more:
from langchain_community.document_loaders import PyPDFLoader
# Implementation details...Upgrade to persistent vector databases:
from langchain_community.vectorstores import Chroma
# Implementation details...Switch between different LLMs:
from langchain_community.llms import Ollama
# Implementation details...We welcome contributions! Here's how to get started:
- π΄ Fork the repository
- πΏ Create a feature branch (
git checkout -b feature/amazing-feature) - π« Commit your changes (
git commit -m 'Add amazing feature') - π Push to the branch (
git push origin feature/amazing-feature) - π¬ Open a Pull Request
- Follow PEP 8 style guidelines
- Add docstrings to all functions
- Include unit tests for new features
- Update documentation as needed
This project is licensed under the MIT License - see the LICENSE file for details.
ImportError: No module named 'streamlit'
pip install -r requirements.txtOpenAI API Key Error
# Ensure .env file exists with valid API key
echo "OPENAI_API_KEY=your_key_here" > .envPerformance Issues
- Use smaller chunk sizes for faster processing
- Consider using lighter embedding models
- Implement caching for frequently accessed documents
- π¦ LangChain - Powerful LLM framework
- π€ OpenAI - GPT-4 API access
- π€ HuggingFace - Open-source transformers
- π Streamlit - Rapid web app development
- π Open Source Community - Continuous inspiration

