Skip to content

Commit

Permalink
Add support for OpenAI embeddings. Bump all dependencies
Browse files Browse the repository at this point in the history
  • Loading branch information
DL committed Dec 8, 2024
1 parent 0f162ba commit 402bd46
Show file tree
Hide file tree
Showing 10 changed files with 162 additions and 77 deletions.
27 changes: 13 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,30 +8,36 @@ The purpose of this package is to offer a convenient question-answering (RAG) sy

## Features

* Supported formats
* Supported document formats
* Build-in parsers:
* `.md` - Divides files based on logical components such as headings, subheadings, and code blocks. Supports additional features like cleaning image links, adding custom metadata, and more.
* `.pdf` - MuPDF-based parser.
* `.docx` - custom parser, supports nested tables.
* Other common formats are supported by `Unstructured` pre-processor:
* List of formats see [here](https://unstructured-io.github.io/unstructured/core/partition.html).

* Support for table parsing via open-source gmft (https://github.com/conjuncts/gmft) or Azure Document Intelligence.

* Optional support for image parsing using Gemini API.

* Supports multiple collection of documents, and filtering the results by a collection.
* Allows interaction with embedded documents, internally supporting the following models and methods (including locally hosted):
* OpenAI models (ChatGPT 3.5/4 and Azure OpenAI).
* HuggingFace models.
* Llama cpp supported models - for full list see [here](https://github.com/ggerganov/llama.cpp#description).

* An ability to update the embeddings incrementally, without a need to re-index the entire document base.
* Interoperability with LiteLLM + Ollama via OpenAI API, supporting hundreds of different models (see [Model configuration for LiteLLM](sample_templates/llm/litellm.yaml))

* Generates dense embeddings from a folder of documents and stores them in a vector database ([ChromaDB](https://github.com/chroma-core/chroma)).
* The following embedding models are supported:
* Hugging Face embeddings.
* Sentence-transformers-based models, e.g., `multilingual-e5-base`.
* Instructor-based models, e.g., `instructor-large`.
* OpenAI embeddings.

* Generates sparse embeddings using SPLADE (https://github.com/naver/splade) to enable hybrid search (sparse + dense).

* An ability to update the embeddings incrementally, without a need to re-index the entire document base.

* Support for table parsing via open-source gmft (https://github.com/conjuncts/gmft) or Azure Document Intelligence.

* Optional support for image parsing using Gemini API.

* Supports the "Retrieve and Re-rank" strategy for semantic search, see [here](https://www.sbert.net/examples/applications/retrieve_rerank/README.html).
* Besides the originally `ms-marco-MiniLM` cross-encoder, more modern `bge-reranker` is supported.

Expand All @@ -44,13 +50,6 @@ The purpose of this package is to offer a convenient question-answering (RAG) sy

* Supprts optional chat history with question contextualization

* Allows interaction with embedded documents, internally supporting the following models and methods (including locally hosted):
* OpenAI models (ChatGPT 3.5/4 and Azure OpenAI).
* HuggingFace models.
* Llama cpp supported models - for full list see [here](https://github.com/ggerganov/llama.cpp#description).
* AutoGPTQ models (temporarily disabled due to broken dependencies).

* Interoperability with LiteLLM + Ollama via OpenAI API, supporting hundreds of different models (see [Model configuration for LiteLLM](sample_templates/llm/litellm.yaml))

* Other features
* Simple CLI and web interfaces.
Expand Down
45 changes: 22 additions & 23 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,59 +11,58 @@ The purpose of this package is to offer a convenient question-answering system w
Features
--------

* Supported formats
* Supported document formats
* Build-in parsers:
* `.md` - Divides files based on logical components such as headings, subheadings, and code blocks. Supports additional features like cleaning image links, adding custom metadata, and more.
* `.pdf` - MuPDF-based parser.
* `.docx` - custom parser, supports nested tables.
* Other common formats are supported by `Unstructured` pre-processor:
* List of formats https://unstructured-io.github.io/unstructured/core/partition.html
* List of formats see [here](https://unstructured-io.github.io/unstructured/core/partition.html).

* Support for table parsing via open-source gmft (https://github.com/conjuncts/gmft) or Azure Document Intelligence.

* Optional support for image parsing using Gemini API.

* Supports multiple collection of documents, and filtering the results by a collection.

* An ability to update the embeddings incrementally, without a need to re-index the entire document base.
* Allows interaction with embedded documents, internally supporting the following models and methods (including locally hosted):
* OpenAI models (ChatGPT 3.5/4 and Azure OpenAI).
* HuggingFace models.
* Llama cpp supported models - for full list see [here](https://github.com/ggerganov/llama.cpp#description).

* Generates dense embeddings from a folder of documents and stores them in a vector database (ChromaDB).
* Interoperability with LiteLLM + Ollama via OpenAI API, supporting hundreds of different models (see [Model configuration for LiteLLM](sample_templates/llm/litellm.yaml))

* Generates dense embeddings from a folder of documents and stores them in a vector database ([ChromaDB](https://github.com/chroma-core/chroma)).
* The following embedding models are supported:

* Huggingface embeddings.
* Hugging Face embeddings.
* Sentence-transformers-based models, e.g., `multilingual-e5-base`.
* Instructor-based models, e.g., `instructor-large`.
* OpenAI embeddings.

* Generates sparse embeddings using SPLADE (https://github.com/naver/splade) to enable hybrid search (sparse + dense).

* Supports the "Retrieve and Re-rank" strategy for semantic search, see - https://www.sbert.net/examples/applications/retrieve_rerank/README.html.
* An ability to update the embeddings incrementally, without a need to re-index the entire document base.

* Support for table parsing via open-source gmft (https://github.com/conjuncts/gmft) or Azure Document Intelligence.

* Optional support for image parsing using Gemini API.

* Supports the "Retrieve and Re-rank" strategy for semantic search, see [here](https://www.sbert.net/examples/applications/retrieve_rerank/README.html).
* Besides the originally `ms-marco-MiniLM` cross-encoder, more modern `bge-reranker` is supported.

* Supports HyDE (Hypothetical Document Embeddings) - https://arxiv.org/pdf/2212.10496.pdf
* Supports HyDE (Hypothetical Document Embeddings) - see [here](https://arxiv.org/pdf/2212.10496.pdf).
* WARNING: Enabling HyDE (via config OR webapp) can significantly alter the quality of the results. Please make sure to read the paper before enabling.
* Based on empirical observations, enabling HyDE significantly boosts quality of the output on a topics where user can't formulate the quesiton using domain specific language of the topic - e.g. when learning new topics.
* From my own experiments, enabling HyDE significantly boosts quality of the output on a topics where user can't formulate the quesiton using domain specific language of the topic - e.g. when learning new topics.

* Support for multi-querying, inspired by `RAG Fusion` - https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1
* When multi-querying is turned on (either config or webapp), the original query will be replaced by 3 variants of the same query, allowing to bridge the gap in the terminology and "offer different angles or perspectives" according to the article.

* Supprts optional chat history with question contextualization

* Allows interaction with embedded documents, internally supporting the following models and methods (including locally hosted):
* OpenAI models (ChatGPT 3.5/4 and Azure OpenAI).
* HuggingFace models.
* Llama cpp supported models - for full list see https://github.com/ggerganov/llama.cpp#description
* AutoGPTQ models (temporarily disabled due to broken dependencies).

* Interoperability with LiteLLM + Ollama via OpenAI API, supporting hundreds of different models (see [Model configuration for LiteLLM](sample_templates/llm/litellm.yaml))

* Other features
* Simple web interface.
* Simple CLI and web interfaces.
* Deep linking into document sections - jump to an individual PDF page or a header in a markdown file.
* Ability to save responses to an offline database for future analysis.
* Experimental API




Installation
============

Expand Down
27 changes: 14 additions & 13 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,30 +1,31 @@
llama-cpp-python==0.2.76
chromadb~=0.5.5
langchain~=0.2.14
langchain-community~=0.2.12
langchain-openai~=0.1.22
langchain-huggingface~=0.0.3
langchain>=0.3,<0.4
langchain-community>=0.3,<0.4
langchain-openai>=0.2,<0.3
langchain-huggingface>=0.1,<0.2
langchain-chroma>=0.1.4,<0.2
pydantic~=2.7
transformers~=4.41
sentence-transformers==3.0.1
transformers~=4.47
sentence-transformers==3.3.1
pypdf2~=3.0.1
ebooklib==0.18
# sentencepiece==0.20
setuptools==67.7.2
loguru
python-dotenv
accelerate~=0.33
accelerate~=1.2.0
protobuf==3.20.2
termcolor
openai~=1.41
openai~=1.57
einops # required for Mosaic models
click
bitsandbytes==0.43.1
# auto-gptq==0.2.0
InstructorEmbedding==1.0.1
unstructured~=0.14.5
pymupdf==1.24.9
streamlit~=1.28
unstructured~=0.16.9
pymupdf==1.25.0
streamlit~=1.40
python-docx~=1.1
six==1.16.0 ; python_version >= "3.10" and python_version < "4.0"
sniffio==1.3.0 ; python_version >= "3.10" and python_version < "4.0"
Expand All @@ -34,8 +35,8 @@ sympy==1.11.1 ; python_version >= "3.10" and python_version < "4.0"
tenacity==8.2.3 ; python_version >= "3.10" and python_version < "4.0"
threadpoolctl==3.1.0 ; python_version >= "3.10" and python_version < "4.0"
tiktoken==0.7.0 ; python_version >= "3.10" and python_version < "4.0"
tokenizers==0.19.1; python_version >= "3.10" and python_version < "4.0"
tokenizers>=0.21,<0.22; python_version >= "3.10" and python_version < "4.0"
tqdm==4.65.0 ; python_version >= "3.10" and python_version < "4.0"
# transformers==4.29.2 ; python_version >= "3.10" and python_version < "4.0"
gmft==0.2.1
google-generativeai~=0.7
google-generativeai~=0.8.3
1 change: 1 addition & 0 deletions sample_templates/generic/config_template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ embeddings:
embeddings_path: /path/to/embedding/folder ## specify a folder where embeddings will be saved.

embedding_model: # Optional embedding model specification, default is e5-large-v2. Swap to a smaller model if out of CUDA memory
# Supported types: "huggingface", "instruct", "openai"
type: sentence_transformer # other supported types - "huggingface" and "instruct"
model_name: "intfloat/e5-large-v2"

Expand Down
40 changes: 40 additions & 0 deletions sample_templates/openai_embeddings.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
cache_folder: /storage/llm/cache

embeddings:
embeddings_path: /storage/llm/embeddings_md2

embedding_model:
type: openai
model_name: "text-embedding-3-large"
additional_kwargs:
dimensions: 1024

splade_config:
n_batch: 5

chunk_sizes:
- 1024

document_settings:
- doc_path: /storage/llm/md_docs2
scan_extensions:
- md
- pdf
passage_prefix: "passage: "
label: "md"


semantic_search:
search_type: similarity
replace_output_path:
- substring_search: "/storage"
substring_replace: "okular:///storage"

append_suffix:
append_template: "#page={page}"

max_char_size: 8192
max_k: 15
query_prefix: "query: "
hyde:
enabled: False
10 changes: 5 additions & 5 deletions src/llmsearch/chroma.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from typing import List, Optional, Tuple

import tqdm
from langchain_community.vectorstores import Chroma
from langchain_chroma import Chroma
from loguru import logger

from llmsearch.config import Config
Expand Down Expand Up @@ -77,8 +77,8 @@ def create_index_from_documents(
metadatas=[doc.metadata for doc in group],
)
logger.info("Generated embeddings. Persisting...")
if vectordb is not None:
vectordb.persist()
# if vectordb is not None:
# vectordb.persist()
vectordb = None

def _load_retriever(self, **kwargs):
Expand All @@ -105,13 +105,13 @@ def add_documents(self, docs: List[Document]):
metadatas=[doc.metadata for doc in group],
)
logger.info("Generated embeddings. Persisting...")
self.vectordb.persist()
# self.vectordb.persist()

def delete_by_id(self, ids: List[str]):
logger.warning(f"Deleting {len(ids)} chunks.")
# vectordb = Chroma(persist_directory=self._persist_folder, embedding_function=self._embeddings)
self.vectordb.delete(ids=ids)
self.vectordb.persist()
# self.vectordb.persist()

def get_documents_by_id(self, document_ids: List[str]) -> List[Document]:
"""Retrieves documents by ids
Expand Down
1 change: 1 addition & 0 deletions src/llmsearch/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ class EmbeddingModelType(str, Enum):
huggingface = "huggingface"
instruct = "instruct"
sentence_transformer = "sentence_transformer"
openai = "openai"


class EmbeddingModel(BaseModel):
Expand Down
Loading

0 comments on commit 402bd46

Please sign in to comment.