A lightweight, pure-Python vector database built from scratch.
This project explores the mechanics of vector similarity search by implementing a custom indexer based on the Vamana Graph algorithm (DiskANN). Designed for educational purposes and lightweight use cases, including semantic search and Retrieval Augmented Generation (RAG).
Note: This project is work in progress. APIs and features are subject to change.
- Vamana Graph Indexing: Utilizes the algorithm behind DiskANN.
- Index Auto-Tuning: Implements adaptive tuning of the parameter alpha to stabilize average graph degree via a custom PI controller, fitting to different dataset structure and improving recall without sacrificing latency.
- Built-in Reranking: Natively supports MMR (Maximal Marginal Relevance) reranking out of the box, guaranteeing varied and contextually rich context for RAG applications.
- C-Level Speed: By leveraging Numba JIT compilation, TrovaDB achieves indexing and search performance comparable to C while maintaining a readable, hackable Python codebase.
- Persistence: The full database is stored reliably in a single SQLite file ensuring portability and crash-safety.
- Data Science Ready SDK: A lightweight Python client designed with native NumPy support and simple interface.
- Familiar Stack: Powered by FastAPI, SQLAlchemy and Alembic.
You can install TrovaDB directly from GitHub using pip.
If you want to run the database server locally, install it with the [server] extra:
pip install "trovadb[server] @ git+https://github.com/AlexHaborets/trovadb.git"pip install git+https://github.com/AlexHaborets/trovadb.gitOnce installed with the [server] extra, you can easily start the database server:
trovadb-server(Runs on localhost:8000 by default)
If you prefer not to install dependencies locally, you can clone the repository and run it instantly via Docker:
docker compose up --buildHere is a quick example of how to connect to the server, upsert vectors, and perform a search:
from trovadb.client import Client
with Client() as client:
# Create a collection
collection = client.get_or_create_collection("demo", dimension=3, metric="cosine")
# Upsert vectors (combines insert & update operations in one)
collection.upsert(
ids=["1", "2", "3", "4", "5"],
vectors=[
[0.1, 0.2, 0.3],
[0.9, 0.8, 0.7],
[0.2, 0.4, 0.4],
[0.1, 0.8, 0.2],
[0.5, 0.3, 0.6]
]
)
q = [0.1, 0.2, 0.3]
# Search for three nearest neighbors of q
results = collection.search(query=q, k=3)
print(results)
# Delete specified vectors
collection.delete(ids=["1", "3"])
# Delete entire collection
client.delete_collection("demo")Check out the examples folder in the root of the repository for detailed usage:
-
Tutorial Notebook: Interactive guide using Pandas and HuggingFace models.
-
Large Dataset Benchmark: A stress test loading 50,000+ DBpedia articles for RAG.
Completed Milestones
- Core vector store functionality
- Persistence
- Docker support
- Benchmarking test suite
Must-Haves
- Comprehensive test suite and unit tests
- In-memory client for local prototyping
Future Features
- Metadata filtering
- Hybrid search support
The name is inspired by the italian phrase "Cerca Trova" ("Seek and you shall find") — a cryptic clue left by the artist Giorgio Vasari, believed to indicate that a lost Da Vinci mural is hidden beneath his fresco in Florence. It felt like a fitting name for a fast and simple search tool.
- DiskANN: Subramanya, S. J., et al. (2019). DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. Advances in Neural Information Processing Systems (NeurIPS).
- FreshDiskANN: Singh, A., et al. (2021). FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search. arXiv preprint arXiv:2105.09613.
- MMR: Carbonell, J., & Goldstein, J. (1998). The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR '98.
- Vamana Visualization: sushrut141/vamana - A helpful repo demonstrating the core algorithm.
This project is licensed under the MIT License. See the LICENSE file for details.