Skip to content

husaynirfan1/simple-rag

Repository files navigation

Simple RAG 📌

A not-so-lightweight Retrieval-Augmented Generation (RAG) system utilizing Milvus (Zilliz Cloud) as a vector database.

This project uses scraped Al-Manar News as sample data.

🚀 Features

  • Coreference Resolution: Uses lingmesscoref to resolve chat history.
  • Multi-turn Conversation: Customizable number of window turns.
  • Web Scraping: Includes a scraping script for Al-Manar English.
  • Scalability: Utilizes Milvus and Zilliz Cloud, making scaling easy.

🛠 Installation

1️⃣ Set Up Milvus (Zilliz Cloud Recommended)

You can either build your own Milvus database or use Zilliz Cloud for convenience. The schema in insertDataChunks.py should look like this (for reference, it is already commented in the script):

fields = [
    # FieldSchema(
    #     name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=True, max_length=100
    # ),
    # FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),  # Store original text
    # FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR),  # Sparse vector field
    # FieldSchema(name="dense_vector", dtype=DataType.FLOAT_VECTOR, dim=dense_dim),  # Dense vector field with 768 dimensions
    # )
]

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Install & Configure Ollama

This project uses Llama 3.1 8B for contextual chunking and Qwen2.5 14B for user interaction. You can modify these in insertDataChunks.py and the Streamlit streamlit_app.py script.

4️⃣ Install NLTK

This project uses NLTK for text chunking. After chunking, context is injected into the chunks, ensuring that the uploaded vectorized hybrid (sparse and dense) data includes both the original text and contextual chunks.

pip install nltk

5️⃣ Process and Insert Data

Run insertDataChunks.py. It will ask to input path of folder. Use folder path in the current repo for testing.

python insertDataChunks.py
Path: /folder

6️⃣ Run Streamlit App

streamlit run streamlit_app.py

🔄 Flowchart

The project initially utilized DeepSeek R1 for its reasoning capabilities. However, for general use, it is recommended to use a non-reasoning model.

Flowchart

💡 Notes

  • You will see handle_message twice in the Streamlit app. This is because the project was initially designed for a Telegram bot as the interface.
  • The repository also includes a pre-written Telegram bot script if you want to use it.

📌 License

This project is open-source under the MIT License.

🔗 Links


📧 Contact

image image

Releases

No releases published

Packages

No packages published

Languages