ZenML RAG System

A Retrieval-Augmented Generation (RAG) system built with ZenML pipelines for document question-answering.

Overview

This RAG system allows you to:

Process documents (PDFs, DOCx, TXT, HTML, etc.) into a vector database
Query the processed documents using natural language
Retrieve the most relevant document chunks for your queries

Features

Multi-format Document Support: Process PDFs, Word documents, text files, HTML, and more
Smart Text Chunking: Split documents intelligently with customizable chunk sizes
Efficient Embedding: Generate embeddings using SentenceTransformers models
Fast Vector Search: Use FAISS for efficient similarity search
Hybrid Search: Combine semantic search with keyword matching for better results
ZenML Integration: Leverage ZenML for pipeline orchestration and reproducibility
CLI Interface: Simple command-line interface for document processing and querying

Project Structure

rag_system/
├── src/
│   ├── utils/
│   │   ├── documents_processor.py   # Document loading and processing
│   │   ├── text_splitter.py         # Text chunking
│   │   └── vector_utils.py          # Vector operations utilities
│   ├── models/
│   │   └── embeddings.py            # Embedding models
│   ├── data/
│   │   └── vector_store.py          # Vector storage and retrieval
│   └── pipelines/
│       ├── document_pipeline.py     # Document processing pipeline
│       └── query_pipeline.py        # Query pipeline
├── rag_system.py                    # Main RAG system interface
├── main.py                          # Command-line entry point
└── README.md                        # Documentation

Installation

Clone the repository:

git clone https://github.com/yourusername/rag-system.git
cd rag-system

Install the required dependencies:

pip install zenml langchain sentence-transformers faiss-cpu pypdf

For additional document format support:

pip install unstructured

Usage

Command Line Interface

The main.py script provides a simple command-line interface with three modes of operation:

# Process documents
python main.py process --document-path path/to/documents/ --storage-path ./vector_db

# Query documents
python main.py query --storage-path ./vector_db --query "What is the main topic of these documents?"

# Interactive mode (ask multiple questions)
python main.py interactive --storage-path ./vector_db

Options

--document-path, -d: Path to document or directory to process
--storage-path, -s: Path to store the vector database (default: temporary directory)
--query, -q: Query string for searching documents
--chunk-size: Size of document chunks (default: 1000)
--chunk-overlap: Overlap between chunks (default: 200)
--top-k, -k: Number of results to return for queries (default: 3)
--embedding-model, -m: Name of embedding model to use (default: "all-MiniLM-L6-v2")
--hybrid-search: Use hybrid search combining semantic and keyword matching

Programmatic Usage

You can also use the RAG system programmatically in your Python code:

from rag_system import RAGSystem

# Initialize RAG system
rag = RAGSystem(storage_path="./vector_db")

# Process a document or directory
result = rag.process_documents(
    document_path="path/to/documents/",
    chunk_size=1000,
    chunk_overlap=200
)
print(f"Processed {result['num_chunks']} document chunks")

# Query the processed documents
answer = rag.query(
    query="What is the main topic discussed in these documents?",
    top_k=3,
    hybrid_search=True
)

# Print results
for result in answer['results']:
    print(f"Rank {result['rank']} (Score: {result['score']:.4f})")
    print(f"Content: {result['content']}")
    print(f"Source: {result['source']}")

Customization

Embedding Models

You can use different SentenceTransformers models by changing the embedding_model parameter:

all-MiniLM-L6-v2 (default): Fast and balanced
all-mpnet-base-v2: Higher quality but slower
paraphrase-multilingual-MiniLM-L12-v2: For multilingual support

Vector Search

Change index_type to "IP" (Inner Product) for cosine similarity instead of L2 distance
Use hybrid_search=True to combine semantic search with keyword matching

Document Chunking

Modify chunk_size and chunk_overlap to optimize for your specific documents
For longer documents, increase chunk size
For technical documents, decrease chunk size and increase overlap

Integration with LLMs

To create a complete RAG system, integrate with an LLM:

from rag_system import RAGSystem
import openai  # or any other LLM API

# Initialize RAG system and process documents
rag = RAGSystem(storage_path="./vector_db")

# Process documents if needed
if not os.path.exists("./vector_db"):
    rag.process_documents("path/to/documents/")

# Query function with LLM integration
def answer_question(query, top_k=3):
    # Get relevant context from RAG system
    results = rag.query(query, top_k=top_k)
    
    # Prepare context for the LLM
    context = "\n\n".join([r["content"] for r in results["results"]])
    
    # Create prompt with context
    prompt = f"Answer the question based on the following context:\n\nContext:\n{context}\n\nQuestion: {query}\n\nAnswer:"
    
    # Call LLM API
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    
    return {
        "answer": response.choices[0].message["content"],
        "sources": [r["source"] for r in results["results"]]
    }

# Example usage
result = answer_question("What are the key benefits described in the document?")
print(result["answer"])
print(f"Sources: {result['sources']}")

Troubleshooting

Common Issues

FileNotFoundError: Ensure the document path is correct and accessible.
Memory Issues: For large documents, reduce batch size or chunk size.
CUDA Errors: Set device to 'cpu' in the embeddings module if you encounter GPU-related errors.
Unsupported File Types: Ensure you have the necessary dependencies for all file types (e.g., unstructured for Word documents).

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ZenML RAG System

Overview

Features

Project Structure

Installation

Usage

Command Line Interface

Options

Programmatic Usage

Customization

Embedding Models

Vector Search

Document Chunking

Integration with LLMs

Troubleshooting

Common Issues

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

MuskanPaliwal/rag-tool-zenml

Folders and files

Latest commit

History

Repository files navigation

ZenML RAG System

Overview

Features

Project Structure

Installation

Usage

Command Line Interface

Options

Programmatic Usage

Customization

Embedding Models

Vector Search

Document Chunking

Integration with LLMs

Troubleshooting

Common Issues

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages