Skip to content

Commit b3459b1

Browse files
authored
Adding Azure CosmosDB Mongo vCore as a datastore. (#379)
* Adding mongo vCore as a datastore * updating the readme file
1 parent 13a0b03 commit b3459b1

File tree

8 files changed

+828
-1
lines changed

8 files changed

+828
-1
lines changed

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ This README provides detailed information on how to set up, develop, and deploy
4444
- [Llama Index](#llamaindex)
4545
- [Chroma](#chroma)
4646
- [Azure Cognitive Search](#azure-cognitive-search)
47+
- [Azure CosmosDB Mongo vCore](#azure-cosmosdb-mongo-vcore)
4748
- [Supabase](#supabase)
4849
- [Postgres](#postgres)
4950
- [AnalyticDB](#analyticdb)
@@ -154,6 +155,12 @@ Follow these steps to quickly set up and run the ChatGPT Retrieval Plugin:
154155
export AZURESEARCH_SERVICE=<your_search_service_name>
155156
export AZURESEARCH_INDEX=<your_search_index_name>
156157
export AZURESEARCH_API_KEY=<your_api_key> (optional, uses key-free managed identity if not set)
158+
159+
# Azure CosmosDB Mongo vCore
160+
export AZCOSMOS_API = <your azure cosmos db api, for now it only supports mongo>
161+
export AZCOSMOS_CONNSTR = <your azure cosmos db mongo vcore connection string>
162+
export AZCOSMOS_DATABASE_NAME = <your mongo database name>
163+
export AZCOSMOS_CONTAINER_NAME = <your mongo container name>
157164
158165
# Supabase
159166
export SUPABASE_URL=<supabase_project_url>
@@ -351,6 +358,9 @@ For detailed setup instructions, refer to [`/docs/providers/llama/setup.md`](/do
351358

352359
[Azure Cognitive Search](https://azure.microsoft.com/products/search/) is a complete retrieval cloud service that supports vector search, text search, and hybrid (vectors + text combined to yield the best of the two approaches). It also offers an [optional L2 re-ranking step](https://learn.microsoft.com/azure/search/semantic-search-overview) to further improve results quality. For detailed setup instructions, refer to [`/docs/providers/azuresearch/setup.md`](/docs/providers/azuresearch/setup.md)
353360

361+
#### Azure CosmosDB Mongo vCore
362+
[Azure CosmosDB Mongo vCore](https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/vcore/) supports vector search on embeddings, and it could be used to seamlessly integrate your AI-based applications with your data stored in the Azure CosmosDB. For detailed instructions, refer to [`/docs/providers/azurecosmosdb/setup.md`](/docs/providers/azurecosmosdb/setup.md)
363+
354364
#### Supabase
355365

356366
[Supabase](https://supabase.com/blog/openai-embeddings-postgres-vector) offers an easy and efficient way to store vectors via [pgvector](https://github.com/pgvector/pgvector) extension for Postgres Database. [You can use Supabase CLI](https://github.com/supabase/cli) to set up a whole Supabase stack locally or in the cloud or you can also use docker-compose, k8s and other options available. For a hosted/managed solution, try [Supabase.com](https://supabase.com/) and unlock the full power of Postgres with built-in authentication, storage, auto APIs, and Realtime features. For detailed setup instructions, refer to [`/docs/providers/supabase/setup.md`](/docs/providers/supabase/setup.md).

datastore/factory.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,10 @@ async def get_datastore() -> DataStore:
3636
from datastore.providers.redis_datastore import RedisDataStore
3737

3838
return await RedisDataStore.init()
39+
case "azurecosmosdb":
40+
from datastore.providers.azurecosmosdb_datastore import AzureCosmosDBDataStore
41+
42+
return await AzureCosmosDBDataStore.create()
3943
case "qdrant":
4044
from datastore.providers.qdrant_datastore import QdrantDataStore
4145

Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
import logging
2+
import os
3+
4+
import certifi
5+
import numpy as np
6+
import pymongo
7+
8+
from pymongo.mongo_client import MongoClient
9+
from abc import ABC, abstractmethod
10+
11+
from typing import Dict, List, Optional
12+
from datetime import datetime
13+
from datastore.datastore import DataStore
14+
from models.models import (
15+
DocumentChunk,
16+
DocumentMetadataFilter,
17+
DocumentChunkWithScore,
18+
DocumentMetadataFilter,
19+
QueryResult,
20+
QueryWithEmbedding,
21+
)
22+
from services.date import to_unix_timestamp
23+
24+
25+
# Read environment variables for CosmosDB Mongo vCore
26+
AZCOSMOS_API = os.environ.get("AZCOSMOS_API", "mongo-vcore")
27+
AZCOSMOS_CONNSTR = os.environ.get("AZCOSMOS_CONNSTR")
28+
AZCOSMOS_DATABASE_NAME = os.environ.get("AZCOSMOS_DATABASE_NAME")
29+
AZCOSMOS_CONTAINER_NAME = os.environ.get("AZCOSMOS_CONTAINER_NAME")
30+
assert AZCOSMOS_API is not None
31+
assert AZCOSMOS_CONNSTR is not None
32+
assert AZCOSMOS_DATABASE_NAME is not None
33+
assert AZCOSMOS_CONTAINER_NAME is not None
34+
35+
# OpenAI Ada Embeddings Dimension
36+
VECTOR_DIMENSION = 1536
37+
38+
39+
# Abstract class similar to the original data store that allows API level abstraction
40+
class AzureCosmosDBStoreApi(ABC):
41+
@abstractmethod
42+
async def ensure(self, num_lists, similarity):
43+
raise NotImplementedError
44+
45+
@abstractmethod
46+
async def upsert_core(self, docId: str, chunks: List[DocumentChunk]) -> List[str]:
47+
raise NotImplementedError
48+
49+
@abstractmethod
50+
async def query_core(self, query: QueryWithEmbedding) -> List[DocumentChunkWithScore]:
51+
raise NotImplementedError
52+
53+
@abstractmethod
54+
async def drop_container(self):
55+
raise NotImplementedError
56+
57+
@abstractmethod
58+
async def delete_filter(self, filter: DocumentMetadataFilter):
59+
raise NotImplementedError
60+
61+
@abstractmethod
62+
async def delete_ids(self, ids: List[str]):
63+
raise NotImplementedError
64+
65+
@abstractmethod
66+
async def delete_document_ids(self, documentIds: List[str]):
67+
raise NotImplementedError
68+
69+
70+
class MongoStoreApi(AzureCosmosDBStoreApi):
71+
def __init__(self, mongoClient: MongoClient):
72+
self.mongoClient = mongoClient
73+
74+
@staticmethod
75+
def _get_metadata_filter(filter: DocumentMetadataFilter) -> dict:
76+
returnedFilter: dict = {}
77+
if filter.document_id is not None:
78+
returnedFilter["document_id"] = filter.document_id
79+
if filter.author is not None:
80+
returnedFilter["metadata.author"] = filter.author
81+
if filter.start_date is not None:
82+
returnedFilter["metadata.created_at"] = {"$gt": datetime.fromisoformat(filter.start_date)}
83+
if filter.end_date is not None:
84+
returnedFilter["metadata.created_at"] = {"$lt": datetime.fromisoformat(filter.end_date)}
85+
if filter.source is not None:
86+
returnedFilter["metadata.source"] = filter.source
87+
if filter.source_id is not None:
88+
returnedFilter["metadata.source_id"] = filter.source_id
89+
return returnedFilter
90+
91+
async def ensure(self, num_lists, similarity):
92+
assert self.mongoClient.is_mongos
93+
self.collection = self.mongoClient[AZCOSMOS_DATABASE_NAME][AZCOSMOS_CONTAINER_NAME]
94+
95+
indexes = self.collection.index_information()
96+
if indexes.get("embedding_cosmosSearch") is None:
97+
# Ensure the vector index exists.
98+
indexDefs: List[any] = [
99+
{
100+
"name": "embedding_cosmosSearch",
101+
"key": {"embedding": "cosmosSearch"},
102+
"cosmosSearchOptions": {
103+
"kind": "vector-ivf",
104+
"numLists": num_lists,
105+
"similarity": similarity,
106+
"dimensions": VECTOR_DIMENSION,
107+
},
108+
}
109+
]
110+
self.mongoClient[AZCOSMOS_DATABASE_NAME].command("createIndexes", AZCOSMOS_CONTAINER_NAME,
111+
indexes=indexDefs)
112+
113+
async def upsert_core(self, docId: str, chunks: List[DocumentChunk]) -> List[str]:
114+
# Until nested doc embedding support is done, treat each chunk as a separate doc.
115+
doc_ids: List[str] = []
116+
for chunk in chunks:
117+
finalDocChunk: dict = {
118+
"_id": f"doc:{docId}:chunk:{chunk.id}",
119+
"document_id": docId,
120+
'embedding': chunk.embedding,
121+
"text": chunk.text,
122+
"metadata": chunk.metadata.__dict__
123+
}
124+
125+
if chunk.metadata.created_at is not None:
126+
finalDocChunk["metadata"]["created_at"] = datetime.fromisoformat(chunk.metadata.created_at)
127+
self.collection.insert_one(finalDocChunk)
128+
doc_ids.append(finalDocChunk["_id"])
129+
return doc_ids
130+
131+
async def query_core(self, query: QueryWithEmbedding) -> List[DocumentChunkWithScore]:
132+
pipeline = [
133+
{
134+
"$search": {
135+
"cosmosSearch": {
136+
"vector": query.embedding,
137+
"path": "embedding",
138+
"k": query.top_k},
139+
"returnStoredSource": True}
140+
},
141+
{
142+
"$project": {
143+
"similarityScore": {
144+
"$meta": "searchScore"
145+
},
146+
"document": "$$ROOT"
147+
}
148+
}
149+
]
150+
151+
# TODO: Add in match filter (once it can be satisfied).
152+
# Perform vector search
153+
query_results: List[DocumentChunkWithScore] = []
154+
for aggResult in self.collection.aggregate(pipeline):
155+
finalMetadata = aggResult["document"]["metadata"]
156+
if finalMetadata["created_at"] is not None:
157+
finalMetadata["created_at"] = datetime.isoformat(finalMetadata["created_at"])
158+
result = DocumentChunkWithScore(
159+
id=aggResult["_id"],
160+
score=aggResult["similarityScore"],
161+
text=aggResult["document"]["text"],
162+
metadata=finalMetadata
163+
)
164+
query_results.append(result)
165+
return query_results
166+
167+
async def drop_container(self):
168+
self.collection.drop()
169+
170+
async def delete_filter(self, filter: DocumentMetadataFilter):
171+
delete_filter = self._get_metadata_filter(filter)
172+
self.collection.delete_many(delete_filter)
173+
174+
async def delete_ids(self, ids: List[str]):
175+
self.collection.delete_many({"_id": {"$in": ids}})
176+
177+
async def delete_document_ids(self, documentIds: List[str]):
178+
self.collection.delete_many({"document_id": {"$in": documentIds}})
179+
180+
181+
# Datastore implementation.
182+
"""
183+
A class representing a memory store for Azure CosmosDB DataStore, currently only supports Mongo vCore
184+
"""
185+
class AzureCosmosDBDataStore(DataStore):
186+
def __init__(self, cosmosStore: AzureCosmosDBStoreApi):
187+
self.cosmosStore = cosmosStore
188+
189+
"""
190+
Creates a new datastore based on the Cosmos Api provided in the environment variables,
191+
only supports Mongo vCore for now
192+
193+
Args:
194+
numLists (int) : This integer is the number of clusters that the inverted file (IVF) index
195+
uses to group the vector data. We recommend that numLists is set to
196+
documentCount/1000 for up to 1 million documents and to sqrt(documentCount)
197+
for more than 1 million documents. Using a numLists value of 1 is akin to
198+
performing brute-force search, which has limited performance.
199+
similarity (str) : Similarity metric to use with the IVF index. Possible options are COS (cosine distance),
200+
L2 (Euclidean distance), and IP (inner product).
201+
202+
"""
203+
@staticmethod
204+
async def create(num_lists, similarity) -> DataStore:
205+
206+
# Create underlying data store based on the API definition.
207+
# Right now this only supports Mongo, but set up to support more.
208+
apiStore: AzureCosmosDBStoreApi = None
209+
if AZCOSMOS_API == "mongo-vcore":
210+
mongoClient = MongoClient(AZCOSMOS_CONNSTR)
211+
apiStore = MongoStoreApi(mongoClient)
212+
else:
213+
raise NotImplementedError
214+
215+
await apiStore.ensure(num_lists, similarity)
216+
store = AzureCosmosDBDataStore(apiStore)
217+
return store
218+
219+
async def _upsert(self, chunks: Dict[str, List[DocumentChunk]]) -> List[str]:
220+
"""
221+
Takes in a list of list of document chunks and inserts them into the database.
222+
Return a list of document ids.
223+
"""
224+
# Initialize a list of ids to return
225+
doc_ids: List[str] = []
226+
for doc_id, chunk_list in chunks.items():
227+
returnedIds = await self.cosmosStore.upsert_core(doc_id, chunk_list)
228+
for returnedId in returnedIds:
229+
doc_ids.append(returnedId)
230+
return doc_ids
231+
232+
async def _query(
233+
self,
234+
queries: List[QueryWithEmbedding],
235+
) -> List[QueryResult]:
236+
"""
237+
Takes in a list of queries with embeddings and filters and
238+
returns a list of query results with matching document chunks and scores.
239+
"""
240+
# Prepare query responses and results object
241+
results: List[QueryResult] = []
242+
243+
# Gather query results in a pipeline
244+
logging.info(f"Gathering {len(queries)} query results", flush=True)
245+
for query in queries:
246+
logging.info(f"Query: {query.query}")
247+
query_results = await self.cosmosStore.query_core(query)
248+
249+
# Add to overall results
250+
results.append(QueryResult(query=query.query, results=query_results))
251+
return results
252+
253+
async def delete(
254+
self,
255+
ids: Optional[List[str]] = None,
256+
filter: Optional[DocumentMetadataFilter] = None,
257+
delete_all: Optional[bool] = None,
258+
) -> bool:
259+
"""
260+
Removes vectors by ids, filter, or everything in the datastore.
261+
Returns whether the operation was successful.
262+
"""
263+
if delete_all:
264+
# fast path - truncate/delete all items.
265+
await self.cosmosStore.drop_container()
266+
return True
267+
268+
if filter:
269+
if filter.document_id is not None:
270+
await self.cosmosStore.delete_document_ids([filter.document_id])
271+
else:
272+
await self.cosmosStore.delete_filter(filter)
273+
274+
if ids:
275+
await self.cosmosStore.delete_ids(ids)
276+
277+
return True

docs/providers/azurecosmosdb/setup.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Azure Cosmos DB
2+
3+
[Azure Cosmos DB](https://azure.microsoft.com/en-us/products/cosmos-db/) Azure Cosmos DB is a fully managed NoSQL and relational database for modern app development. Using Azure Cosmos DB for MongoDB vCore, you can store vector embeddings in your documents and perform [vector similarity search](https://learn.microsoft.com/azure/cosmos-db/mongodb/vcore/vector-search) on a fully managed MongoDB-compatible database service.
4+
5+
Learn more about Azure Cosmos DB for MongoDB vCore [here](https://learn.microsoft.com/azure/cosmos-db/mongodb/vcore/). If you don't have an Azure account, you can start setting one up [here](https://azure.microsoft.com/).
6+
7+
## Environment variables
8+
9+
| Name | Required | Description | Default |
10+
| ---------------------------- | -------- |-------------------------------------------------------------------------| ------------------- |
11+
| `DATASTORE` | Yes | Datastore name, set to `azurecosmosdb` | |
12+
| `BEARER_TOKEN` | Yes | Secret token | |
13+
| `OPENAI_API_KEY` | Yes | OpenAI API key | |
14+
| `AZCOSMOS_API` | Yes | Name of the API you're connecting to. Currently supported `mongo-vcore` | |
15+
| `AZCOSMOS_CONNSTR` | Yes | The connection string to your account. | |
16+
| `AZCOSMOS_DATABASE_NAME` | Yes | The database where the data is stored/queried | |
17+
| `AZCOSMOS_CONTAINER_NAME` | Yes | The container where the data is stored/queried | |
18+
19+
## Indexing
20+
On first insert, the datastore will create the collection and index if necessary on the field `embedding`. Currently hybrid search is not yet supported.

0 commit comments

Comments
 (0)