Skip to content

A ZenML-based RAG system for document Q&A with multi-format support. My exploration project to understand how RAG systems work under the hood.

License

Notifications You must be signed in to change notification settings

MuskanPaliwal/rag-tool-zenml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ZenML RAG System

A Retrieval-Augmented Generation (RAG) system built with ZenML pipelines for document question-answering.

Overview

This RAG system allows you to:

  1. Process documents (PDFs, DOCx, TXT, HTML, etc.) into a vector database
  2. Query the processed documents using natural language
  3. Retrieve the most relevant document chunks for your queries

Features

  • Multi-format Document Support: Process PDFs, Word documents, text files, HTML, and more
  • Smart Text Chunking: Split documents intelligently with customizable chunk sizes
  • Efficient Embedding: Generate embeddings using SentenceTransformers models
  • Fast Vector Search: Use FAISS for efficient similarity search
  • Hybrid Search: Combine semantic search with keyword matching for better results
  • ZenML Integration: Leverage ZenML for pipeline orchestration and reproducibility
  • CLI Interface: Simple command-line interface for document processing and querying

Project Structure

rag_system/
├── src/
│   ├── utils/
│   │   ├── documents_processor.py   # Document loading and processing
│   │   ├── text_splitter.py         # Text chunking
│   │   └── vector_utils.py          # Vector operations utilities
│   ├── models/
│   │   └── embeddings.py            # Embedding models
│   ├── data/
│   │   └── vector_store.py          # Vector storage and retrieval
│   └── pipelines/
│       ├── document_pipeline.py     # Document processing pipeline
│       └── query_pipeline.py        # Query pipeline
├── rag_system.py                    # Main RAG system interface
├── main.py                          # Command-line entry point
└── README.md                        # Documentation

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/rag-system.git
cd rag-system
  1. Install the required dependencies:
pip install zenml langchain sentence-transformers faiss-cpu pypdf
  1. For additional document format support:
pip install unstructured

Usage

Command Line Interface

The main.py script provides a simple command-line interface with three modes of operation:

# Process documents
python main.py process --document-path path/to/documents/ --storage-path ./vector_db

# Query documents
python main.py query --storage-path ./vector_db --query "What is the main topic of these documents?"

# Interactive mode (ask multiple questions)
python main.py interactive --storage-path ./vector_db

Options

  • --document-path, -d: Path to document or directory to process
  • --storage-path, -s: Path to store the vector database (default: temporary directory)
  • --query, -q: Query string for searching documents
  • --chunk-size: Size of document chunks (default: 1000)
  • --chunk-overlap: Overlap between chunks (default: 200)
  • --top-k, -k: Number of results to return for queries (default: 3)
  • --embedding-model, -m: Name of embedding model to use (default: "all-MiniLM-L6-v2")
  • --hybrid-search: Use hybrid search combining semantic and keyword matching

Programmatic Usage

You can also use the RAG system programmatically in your Python code:

from rag_system import RAGSystem

# Initialize RAG system
rag = RAGSystem(storage_path="./vector_db")

# Process a document or directory
result = rag.process_documents(
    document_path="path/to/documents/",
    chunk_size=1000,
    chunk_overlap=200
)
print(f"Processed {result['num_chunks']} document chunks")

# Query the processed documents
answer = rag.query(
    query="What is the main topic discussed in these documents?",
    top_k=3,
    hybrid_search=True
)

# Print results
for result in answer['results']:
    print(f"Rank {result['rank']} (Score: {result['score']:.4f})")
    print(f"Content: {result['content']}")
    print(f"Source: {result['source']}")

Customization

Embedding Models

You can use different SentenceTransformers models by changing the embedding_model parameter:

  • all-MiniLM-L6-v2 (default): Fast and balanced
  • all-mpnet-base-v2: Higher quality but slower
  • paraphrase-multilingual-MiniLM-L12-v2: For multilingual support

Vector Search

  • Change index_type to "IP" (Inner Product) for cosine similarity instead of L2 distance
  • Use hybrid_search=True to combine semantic search with keyword matching

Document Chunking

  • Modify chunk_size and chunk_overlap to optimize for your specific documents
  • For longer documents, increase chunk size
  • For technical documents, decrease chunk size and increase overlap

Integration with LLMs

To create a complete RAG system, integrate with an LLM:

from rag_system import RAGSystem
import openai  # or any other LLM API

# Initialize RAG system and process documents
rag = RAGSystem(storage_path="./vector_db")

# Process documents if needed
if not os.path.exists("./vector_db"):
    rag.process_documents("path/to/documents/")

# Query function with LLM integration
def answer_question(query, top_k=3):
    # Get relevant context from RAG system
    results = rag.query(query, top_k=top_k)
    
    # Prepare context for the LLM
    context = "\n\n".join([r["content"] for r in results["results"]])
    
    # Create prompt with context
    prompt = f"Answer the question based on the following context:\n\nContext:\n{context}\n\nQuestion: {query}\n\nAnswer:"
    
    # Call LLM API
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    
    return {
        "answer": response.choices[0].message["content"],
        "sources": [r["source"] for r in results["results"]]
    }

# Example usage
result = answer_question("What are the key benefits described in the document?")
print(result["answer"])
print(f"Sources: {result['sources']}")

Troubleshooting

Common Issues

  1. FileNotFoundError: Ensure the document path is correct and accessible.
  2. Memory Issues: For large documents, reduce batch size or chunk size.
  3. CUDA Errors: Set device to 'cpu' in the embeddings module if you encounter GPU-related errors.
  4. Unsupported File Types: Ensure you have the necessary dependencies for all file types (e.g., unstructured for Word documents).

License

This project is licensed under the MIT License.

About

A ZenML-based RAG system for document Q&A with multi-format support. My exploration project to understand how RAG systems work under the hood.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages