ThemisDB Index Module

Module Purpose

The Index module provides high-performance indexing infrastructure for ThemisDB, supporting multiple index types across all five data models (relational, document, graph, vector, timeseries). It implements state-of-the-art algorithms including HNSW for vector similarity search, R-tree for spatial queries, and adaptive index recommendations based on query patterns.

Relevant Interfaces

Interface / File	Role
`hnsw_index.cpp`	HNSW vector similarity index with GPU acceleration support
`rtree_index.cpp`	R-tree spatial index for geospatial queries
`adaptive_index.cpp`	Automatic index recommendation based on query patterns
`graph_index_manager.cpp`	Graph adjacency list indexing and traversal support
`secondary_index.cpp`	B-tree, range, sparse, and composite secondary indexes

Scope

In Scope:

Vector indexing (HNSW, GPU-accelerated similarity search)
Secondary indexes (B-tree, range, sparse, composite)
Graph indexing (adjacency lists, BFS/DFS traversal)
Spatial indexing (R-tree for geospatial queries, Z-order curves)
Adaptive indexing (automatic index recommendations)
GPU acceleration (Vulkan, CUDA, HIP support)
Quantization for memory efficiency (Product Quantization, Binary Quantization, Residual Quantization)
Multi-distance metrics (L2, Cosine, Dot Product)
Advanced features (GNN embeddings, temporal graphs, rotary embeddings)

Out of Scope:

Query parsing and execution (handled by query module)
Data persistence (handled by storage module)
Network protocols and APIs (handled by server module)
Full-text search (planned for future release)

Key Components

IndexManager

Location: index_manager.cpp, ../include/index/index_manager.h

Unified index manager coordinating all index types with dependency injection support.

// Create index manager with dependency injection
auto evaluator = std::make_shared<ExpressionEvaluator>();
auto storage = std::make_shared<RocksDBWrapper>(config);
auto index_manager = std::make_shared<IndexManager>(evaluator, storage);

// Or use default factory
auto index_manager = IndexManager::createDefault();

// Access specialized managers
auto vector_mgr = index_manager->getVectorIndexManager();
auto secondary_mgr = index_manager->getSecondaryIndexManager();
auto graph_mgr = index_manager->getGraphIndexManager();

Features:

Breaks circular dependencies (Index ↔ Query ↔ Storage)
Enables isolated unit testing with mock implementations
Supports filter expressions via injected evaluator
Maintains backward compatibility

VectorIndexManager

Location: vector_index.cpp, ../include/index/vector_index.h

High-performance vector similarity search with HNSW algorithm and optional GPU acceleration.

Algorithm: HNSW (Hierarchical Navigable Small World graphs) - Malkov & Yashunin (2018), IEEE TPAMI

Features:

HNSW Index: Approximate nearest neighbor (ANN) search
GPU Acceleration: Vulkan/CUDA/HIP support via GPUVectorIndex
Multiple Metrics: L2 (Euclidean), Cosine, Dot Product
Quantization: Product Quantization (PQ), Binary Quantization, Residual Quantization
Advanced Features: IVF+PQ, FAISS integration, multi-vector search
Production Defaults: Optimized parameters via HnswProductionDefaults
Workload Auto-Tuner: Runtime construction-parameter selection via HnswParameterTuner and WorkloadClassifier
RocksDB Persistence: Atomic updates via WriteBatch
Audit Logging: Track vector operations for compliance

Basic Usage:

RocksDBWrapper db(config);
VectorIndexManager vim(db);

// Initialize with dimension and metric
vim.init("embeddings", 1536, VectorIndexManager::Metric::COSINE);

// Add vectors (persisted to RocksDB)
std::vector<float> embedding = model.encode(text);
vim.addVector("embeddings", "doc123", embedding);

// Search for similar vectors
auto results = vim.search("embeddings", query_vector, /*k=*/10);
for (const auto& result : results) {
    std::cout << "ID: " << result.pk 
              << " Distance: " << result.distance << std::endl;
}

Advanced Configuration:

// Enable FAISS IVF+PQ for large-scale datasets
VectorIndexManager::AdvancedIndexConfig adv_config;
adv_config.enabled = true;
adv_config.index_type = VectorIndexManager::AdvancedIndexConfig::Type::IVF_PQ;
adv_config.nlist = 1024;      // Number of IVF clusters
adv_config.nprobe = 64;       // Clusters to search
adv_config.use_pq = true;     // Enable Product Quantization
adv_config.pq_m = 8;          // 8 sub-quantizers
adv_config.use_gpu = true;    // GPU acceleration
adv_config.gpu_device = 0;

vim.setAdvancedIndexConfig(adv_config);

// Production-optimized HNSW parameters
HnswProductionDefaults defaults;
auto hnsw_config = defaults.getConfig(
    HnswProductionDefaults::UseCaseProfile::RAG,  // Use case
    1000000,                                       // Dataset size
    HnswProductionDefaults::LatencyRequirement::P99_SUB_10MS
);
vim.setHnswConfig("embeddings", hnsw_config);

Workload-Adaptive Construction Parameters (Auto-Tuner):

// Option A: Classify workload automatically from observed traffic
WorkloadClassifier classifier;

// Record events as they occur (thread-safe)
classifier.recordInsert(1);        // single-vector inserts → OLTP signal
classifier.recordQuery(/*k=*/10);  // medium-k queries → RAG signal

// Detect workload type from accumulated stats
auto workload = classifier.detectWorkload();
// → OLTP / ANALYTICS / RAG / BATCH_INSERT / MIXED

// Option B: Ask the tuner to pick M and ef_construction automatically
HnswParameterTuner::Config tuner_cfg;
tuner_cfg.adaptive = true;
tuner_cfg.target_recall = 0.95;
HnswParameterTuner tuner(tuner_cfg);

// Record a few warm-up queries, then ask for construction params
tuner.recordQueryResult(/*k=*/10, /*ef=*/64, /*latency_ms=*/3.0, /*recall=*/0.96);
auto cp = tuner.getAutoTunedConstructionParams(/*dataset_size=*/500000);
// cp.M, cp.ef_construction, cp.detected_workload are all set

// Option C: Point-in-time target-based tuning (no warm-up needed)
auto params = HnswProductionDefaults::autoTuneParameters(
    500000,   // dataset size
    768,      // vector dimension
    100,      // sample_size (currently informational)
    10.0,     // target latency ms
    0.95      // target recall
);
// params.M, params.ef_construction, params.ef_search selected for the targets

Quantization for Memory Efficiency:

// Product Quantization: 10-100x compression
ProductQuantizer::Config pq_config;
pq_config.num_subquantizers = 8;     // Divide 1536D into 8 subspaces
pq_config.num_centroids = 256;       // 8-bit codes
pq_config.prefer_faiss = true;       // Use FAISS K-means acceleration

auto pq = std::make_unique<ProductQuantizer>(pq_config);
pq->train(training_vectors);

// Encode vectors to 8-bit codes (1536D float32 -> 8 bytes)
std::vector<uint8_t> codes = pq->encode(vector);

// Asymmetric distance computation (ADC)
float distance = pq->compute_distance(codes, query_vector);

GPU Acceleration:

// Cross-platform GPU acceleration via Vulkan
GPUVectorIndex::Config gpu_config;
gpu_config.backend = GPUVectorIndex::Backend::VULKAN;  // Cross-platform
gpu_config.metric = GPUVectorIndex::DistanceMetric::COSINE;
gpu_config.M = 16;                    // HNSW connections per layer
gpu_config.efConstruction = 200;      // Construction accuracy
gpu_config.efSearch = 64;             // Search accuracy
gpu_config.batchSize = 512;           // Parallel batch size

auto gpu_index = std::make_unique<GPUVectorIndex>(gpu_config);
gpu_index->addVectors(ids, vectors);

// Fast GPU search (200K+ queries/sec)
auto results = gpu_index->search(query_vector, /*k=*/10);

Performance Characteristics:

Point queries: 0.1-1ms (HNSW)
Batch search: 200K+ queries/sec (GPU)
Memory: 1.5-2x of raw vectors (HNSW), 0.1-0.01x with PQ
Recall@10: 95-99% (configurable via efSearch)

SecondaryIndexManager

Location: secondary_index.cpp, ../include/index/secondary_index.h

B-tree-based secondary indexes for equality, range, and composite queries.

Key Schema:

Single-column:  idx:table:column:value:PK
Composite:      idx:table:col1+col2:val1:val2:PK
Range:          range:table:column:value:PK (lexicographic ordering)
Sparse:         sparse:table:column:value:PK (skips NULL values)
Geo:            geo:table:column:morton_code:PK (Z-order curve)
TTL:            ttl:table:column:expire_ts:PK (time-to-live)

Basic Usage:

SecondaryIndexManager sim(db);

// Create single-column index
sim.createIndex("users", "email", /*unique=*/true);

// Create composite index
sim.createCompositeIndex("orders", {"customer_id", "order_date"});

// Atomic updates via WriteBatch
auto batch = db.createWriteBatch();
BaseEntity user;
user.set("id", "user123");
user.set("email", "john@example.com");
user.set("name", "John Doe");

sim.updateIndex("users", user, batch);
batch.commit();

// Lookup by index
auto pks = sim.lookupByIndex("users", "email", "john@example.com");

Range Queries:

// Create range index for ordered scans
sim.createRangeIndex("events", "timestamp");

// Range query [start, end]
auto results = sim.rangeQuery(
    "events", "timestamp",
    "2024-01-01", "2024-12-31",
    /*limit=*/100
);

Sparse Indexes:

// Sparse index: skip NULL/missing values (reduces index size)
sim.createSparseIndex("products", "discount_code");

// Only indexes products with discount codes
// Products without discount_code are not in the index

Geo Indexes (Z-order curves):

// Geospatial index with bounding box queries
sim.createGeoIndex("locations", "coordinates");

// Bounding box search
geo::MBR bbox{37.7, -122.5, 37.8, -122.4};  // San Francisco area
auto results = sim.geoQuery("locations", "coordinates", bbox);

// Radius search (within 5km of point)
auto nearby = sim.radiusQuery(
    "locations", "coordinates",
    /*lat=*/37.7749, /*lon=*/-122.4194, /*radius_km=*/5.0
);

TTL Indexes:

// Automatic expiration after TTL
sim.createTTLIndex("sessions", "expires_at", /*ttl_seconds=*/3600);

// Cleanup expired entries (call periodically)
sim.cleanupExpired("sessions", "expires_at");

Performance Characteristics:

Point lookups: 10-50μs (with RocksDB cache)
Range scans: 100K-500K keys/sec
Unique constraint checks: O(1) via bloom filters
Space overhead: 20-40% of data size

GraphIndexManager

Location: graph_index.cpp, ../include/index/graph_index.h

Adjacency list indexes for graph traversal with temporal and property graph support.

Key Schema:

Outbound edges:  graph:out:<from_pk>:<edge_id>  -> <to_pk>
Inbound edges:   graph:in:<to_pk>:<edge_id>     -> <from_pk>
Edge properties: edge:<edge_id>                 -> properties

Basic Usage:

GraphIndexManager gim(db);

// Add edge (bidirectional index)
BaseEntity edge;
edge.set("id", "edge123");
edge.set("_from", "user:alice");
edge.set("_to", "user:bob");
edge.set("type", "follows");
edge.set("since", "2024-01-01");

gim.addEdge(edge);

// Query neighbors
auto [status, out_neighbors] = gim.outNeighbors("user:alice");
for (const auto& neighbor_id : out_neighbors) {
    std::cout << "Alice follows: " << neighbor_id << std::endl;
}

auto [status2, in_neighbors] = gim.inNeighbors("user:bob");
// In-neighbors: users who follow Bob

Graph Traversal:

// BFS traversal (breadth-first)
std::vector<std::string> visited;
gim.bfs("user:alice", [&visited](const std::string& node) {
    visited.push_back(node);
    return true;  // Continue traversal
}, /*max_depth=*/3);

// DFS traversal (depth-first)
gim.dfs("user:alice", [](const std::string& node) {
    std::cout << "Visiting: " << node << std::endl;
    return true;
}, /*max_depth=*/5);

// Shortest path (BFS-based)
auto path = gim.shortestPath("user:alice", "user:charlie");

Multi-Graph Support:

// Multiple graphs in single database
edge.set("graphId", "social_network");
gim.addEdge(edge);

edge.set("graphId", "org_chart");
gim.addEdge(edge);

// Query specific graph
auto neighbors = gim.outNeighbors("user:alice", "social_network");

Temporal Graphs:

// Temporal edges with time validity
TemporalGraph temporal_graph(db);
temporal_graph.addEdge(
    "user:alice", "user:bob", "follows",
    /*valid_from=*/"2024-01-01", /*valid_until=*/"2024-12-31"
);

// Time-travel queries (graph state at specific time)
auto neighbors_at = temporal_graph.getNeighborsAt(
    "user:alice", /*timestamp=*/"2024-06-01"
);

Property Graphs:

// Rich edge properties
PropertyGraph prop_graph(db);
std::unordered_map<std::string, std::string> properties = {
    {"weight", "0.8"},
    {"label", "friend"},
    {"created", "2024-01-01"}
};
prop_graph.addEdge("user:alice", "user:bob", properties);

// Query by edge properties
auto edges = prop_graph.findEdgesByProperty("weight", "0.8");

Graph Analytics:

// PageRank
GraphAnalytics analytics(gim);
auto scores = analytics.pageRank(/*iterations=*/10, /*damping=*/0.85);

// Community detection (Louvain algorithm)
auto communities = analytics.detectCommunities();

// Centrality measures
auto betweenness = analytics.betweennessCentrality("user:alice");
auto closeness = analytics.closenessCentrality("user:alice");

Performance Characteristics:

Neighbor queries: 10-50μs (in-memory cache)
BFS/DFS traversal: 100K-500K nodes/sec
Shortest path: O(|V| + |E|) with BFS
Space overhead: 2x edges (bidirectional index)

SpatialIndexManager

Location: spatial_index.cpp, ../include/index/spatial_index.h

R-tree implementation for efficient geospatial queries with Morton code (Z-order curve) optimization.

Features:

R-tree: Hierarchical spatial index for 2D/3D data
Z-order curves: Morton codes for spatial locality
Bounding boxes: MBR (Minimum Bounding Rectangle) queries
Distance queries: Radius search, nearest neighbors
GeoJSON support: Point, LineString, Polygon
EWKB encoding: Efficient binary geometry format

Usage:

SpatialIndexManager::Config config;
config.max_entries_per_node = 16;
config.min_entries_per_node = 4;
config.total_bounds = geo::MBR{-180, -90, 180, 90};  // World bounds

SpatialIndexManager spatial(db, config);

// Insert geometry (GeoJSON Point)
spatial.insert(
    "locations", "poi123",
    /*lat=*/37.7749, /*lon=*/-122.4194,
    /*z=*/std::nullopt  // Optional Z-coordinate
);

// Bounding box search
geo::MBR query_bbox{37.7, -122.5, 37.8, -122.4};
auto results = spatial.searchIntersects("locations", query_bbox);

// Radius search (within 5km)
auto nearby = spatial.searchRadius(
    "locations",
    /*center_lat=*/37.7749, /*center_lon=*/-122.4194,
    /*radius_km=*/5.0,
    /*limit=*/10
);

// Nearest neighbors
auto nearest = spatial.kNearestNeighbors(
    "locations",
    /*lat=*/37.7749, /*lon=*/-122.4194,
    /*k=*/5
);

Z-order Curves (Morton Encoding):

// Convert 2D coordinates to 1D Morton code
geo::MBR world_bounds{-180, -90, 180, 90};
uint64_t morton = MortonEncoder::encode2D(
    /*lon=*/-122.4194, /*lat=*/37.7749, world_bounds
);

// Range queries on Morton codes (efficient for nearby points)
auto ranges = MortonEncoder::getRanges(query_bbox, world_bounds);
for (const auto& [start, end] : ranges) {
    // Scan RocksDB range [start, end]
}

Performance Characteristics:

Point queries: 10-100μs
Bounding box queries: 1-10ms (depends on result size)
Radius search: 1-20ms (depends on radius)
R-tree height: O(log_M N) where M=16
Space overhead: 30-50% of data size

AdaptiveIndexManager

Location: adaptive_index.cpp, ../include/index/adaptive_index.h

Automatic index recommendation based on query pattern analysis.

Features:

Query Pattern Tracking: Field access frequency, filter predicates, join conditions
Cache-Aware Metrics: L3 cache misses and miss penalties
Automatic Recommendations: Suggests indexes based on cost-benefit analysis
Index Lifecycle: Create/drop indexes based on usage patterns
Anomaly Detection: Identifies slow queries requiring indexes

Usage:

AdaptiveIndexManager adaptive(db);

// Track query patterns
adaptive.recordQuery(
    "users", "email",
    QueryPatternTracker::Operation::EQUALITY,
    /*execution_time_ms=*/45.2,
    /*cache_miss=*/true,
    /*cache_miss_penalty_ms=*/12.3
);

// Get index recommendations
auto recommendations = adaptive.getRecommendations(/*min_benefit=*/100.0);
for (const auto& rec : recommendations) {
    std::cout << "Create index on " << rec.collection 
              << "." << rec.field
              << " (estimated benefit: " << rec.benefit_score << "ms/query)"
              << std::endl;
}

// Auto-create recommended indexes
adaptive.applyRecommendations(recommendations, secondary_index_manager);

// Periodic maintenance
adaptive.cleanupUnusedIndexes(/*unused_days=*/30);

Index Recommendation Algorithm:

// Cost-benefit analysis
struct IndexRecommendation {
    std::string collection;
    std::string field;
    IndexType type;
    double benefit_score;      // Estimated query speedup (ms)
    double creation_cost;      // Index build time (ms)
    double space_cost_mb;      // Storage overhead
    double cache_improvement;  // Reduction in cache misses
};

// Ranking factors:
// 1. Query frequency (high = better)
// 2. Execution time improvement (high = better)
// 3. Cache miss reduction (high = better)
// 4. Selectivity (low cardinality = better for B-tree)
// 5. Space cost (low = better)

Performance Characteristics:

Pattern tracking: < 1μs overhead per query
Recommendation generation: 10-100ms
Auto-index creation: 1-60s (depends on data size)

Architecture

Dependency Injection

The Index module uses dependency injection to break circular dependencies:

┌─────────────────┐
│  IndexManager   │
│  (Coordinator)  │
└────────┬────────┘
         │
         ├─► IExpressionEvaluator (injected from Query module)
         ├─► IStorageEngine (injected from Storage module)
         │
         ├─────────────────────────────────┐
         │                                 │
         ▼                                 ▼
┌────────────────────┐          ┌────────────────────┐
│ VectorIndexManager │          │SecondaryIndexMgr   │
└────────────────────┘          └────────────────────┘
         │                                 │
         ├─► GPUVectorIndex               ├─► SpatialIndexManager
         ├─► ProductQuantizer             ├─► RocksDBWrapper
         ├─► HnswLayerOptimizer           │
         └─► RotaryEmbedding              │
                                          ▼
                                 ┌────────────────────┐
                                 │ GraphIndexManager  │
                                 └────────────────────┘
                                          │
                                          ├─► TemporalGraph
                                          ├─► PropertyGraph
                                          └─► GraphAnalytics

GPU Acceleration Pipeline

┌──────────────┐
│ Query Vector │
└──────┬───────┘
       │
       ▼
┌──────────────────────────┐
│  GPUVectorIndex          │
│  (Backend Selection)     │
└──────┬───────────────────┘
       │
       ├─► Vulkan Backend (Cross-platform: NVIDIA, AMD, Intel, Apple)
       ├─► CUDA Backend (NVIDIA GPUs)
       ├─► HIP Backend (AMD GPUs)
       └─► CPU Fallback (SIMD acceleration)
       │
       ▼
┌──────────────────────────┐
│  Compute Kernel          │
│  (Distance Computation)  │
└──────┬───────────────────┘
       │
       ▼
┌──────────────────────────┐
│  Top-K Selection         │
│  (Heap-based)            │
└──────┬───────────────────┘
       │
       ▼
┌──────────────┐
│   Results    │
└──────────────┘

Quantization Pipeline

┌──────────────────┐
│  Float32 Vectors │
│  (1536D × 4B)    │
│  = 6KB/vector    │
└────────┬─────────┘
         │
         ▼
┌──────────────────────────┐
│  Product Quantization    │
│  (8 subquantizers)       │
└────────┬─────────────────┘
         │
         ▼
┌──────────────────┐
│  8-bit Codes     │
│  (8 × 1B)        │
│  = 8B/vector     │
│  750x compression│
└────────┬─────────┘
         │
         ▼
┌──────────────────────────┐
│  Asymmetric Distance     │
│  Computation (ADC)       │
│  - Query: float32        │
│  - Database: uint8       │
│  - Precompute tables     │
└────────┬─────────────────┘
         │
         ▼
┌──────────────────┐
│  Search Results  │
│  (95-99% recall) │
└──────────────────┘

Integration Points

With Storage Module

// RocksDB persistence for all index types
RocksDBWrapper db(config);

// Vector index persistence
VectorIndexManager vim(db);
vim.init("embeddings", 1536, VectorIndexManager::Metric::COSINE);
vim.addVector("embeddings", "doc123", vector);  // Atomic WriteBatch

// Secondary index atomicity
SecondaryIndexManager sim(db);
auto batch = db.createWriteBatch();
sim.updateIndex("users", entity, batch);  // Index update in same batch
batch.commit();  // Atomic: data + indexes

With Query Module

// Expression evaluation for filter pushdown
auto evaluator = std::make_shared<ExpressionEvaluator>();
vim.setExpressionEvaluator(evaluator);

// Filter during vector search
auto results = vim.searchWithFilter(
    "embeddings", query_vector, /*k=*/10,
    /*filter=*/"metadata.category = 'science' AND year > 2020"
);

With Server Module

// API handler integration
void handleVectorSearch(const Request& req, Response& resp) {
    auto query_vector = req.getVector("vector");
    auto k = req.getInt("k", 10);
    
    auto results = vim.search("embeddings", query_vector, k);
    
    resp.setJson({
        {"results", results},
        {"count", results.size()}
    });
}

API/Usage Examples

Complete RAG Pipeline

// Setup
RocksDBWrapper db(config);
VectorIndexManager vim(db);
SecondaryIndexManager sim(db);

// Initialize indexes
vim.init("documents", 1536, VectorIndexManager::Metric::COSINE);
sim.createIndex("documents", "category");
sim.createIndex("documents", "year");

// Ingest documents
for (const auto& doc : documents) {
    // Generate embedding
    auto embedding = model.encode(doc.text);
    
    // Store in vector index
    vim.addVector("documents", doc.id, embedding);
    
    // Store metadata
    BaseEntity entity;
    entity.set("id", doc.id);
    entity.set("category", doc.category);
    entity.set("year", doc.year);
    entity.set("text", doc.text);
    
    auto batch = db.createWriteBatch();
    sim.updateIndex("documents", entity, batch);
    batch.commit();
}

// Query with filters
auto query_embedding = model.encode(query_text);
auto results = vim.searchWithFilter(
    "documents", query_embedding, /*k=*/10,
    /*filter=*/"category = 'science' AND year > 2020"
);

// Retrieve full documents
for (const auto& result : results) {
    auto doc = db.get("documents:" + result.pk);
    std::cout << "Score: " << result.distance << std::endl;
    std::cout << "Text: " << doc["text"] << std::endl;
}

Graph Knowledge Base

// Build knowledge graph
GraphIndexManager gim(db);
VectorIndexManager vim(db);

// Add entities with embeddings
vim.init("entities", 768, VectorIndexManager::Metric::COSINE);
vim.addVector("entities", "entity:einstein", einstein_embedding);
vim.addVector("entities", "entity:relativity", relativity_embedding);

// Add relationships
BaseEntity edge;
edge.set("id", "edge123");
edge.set("_from", "entity:einstein");
edge.set("_to", "entity:relativity");
edge.set("type", "DISCOVERED");
edge.set("year", "1905");
gim.addEdge(edge);

// Hybrid query: vector similarity + graph traversal
auto similar = vim.search("entities", query_embedding, /*k=*/5);
for (const auto& entity : similar) {
    auto neighbors = gim.outNeighbors(entity.pk);
    // Expand context via graph
}

Geospatial Search

// Index locations
SpatialIndexManager spatial(db, config);
SecondaryIndexManager sim(db);

sim.createGeoIndex("restaurants", "location");

// Add restaurants
BaseEntity restaurant;
restaurant.set("id", "rest123");
restaurant.set("name", "Luigi's Pizza");
restaurant.set("lat", 37.7749);
restaurant.set("lon", -122.4194);
restaurant.set("category", "Italian");

auto batch = db.createWriteBatch();
sim.updateIndex("restaurants", restaurant, batch);
spatial.insert("restaurants", "rest123", 37.7749, -122.4194);
batch.commit();

// Find nearby restaurants
auto nearby = spatial.searchRadius(
    "restaurants",
    /*lat=*/37.7749, /*lon=*/-122.4194,
    /*radius_km=*/2.0,
    /*limit=*/10
);

// Filter by category
std::vector<SpatialResult> italian_restaurants;
for (const auto& result : nearby) {
    auto doc = db.get("restaurants:" + result.primary_key);
    if (doc["category"] == "Italian") {
        italian_restaurants.push_back(result);
    }
}

Dependencies

Required

RocksDB: Key-value storage (via storage/rocksdb_wrapper.h)
nlohmann/json: JSON serialization
Eigen3: Linear algebra for vector operations (optional, fallback available)

Optional

FAISS: Facebook AI Similarity Search (advanced vector indexes, IVF+PQ)
- Enables: IVF clustering, Product Quantization, GPU acceleration
- License: MIT
HNSWlib: HNSW implementation (if not using FAISS)
- License: Apache 2.0
Vulkan: Cross-platform GPU compute (via acceleration/compute_backend.h)
- Linux: vulkan-devel, vulkan-loader
- Windows: Vulkan SDK
- macOS: MoltenVK
CUDA: NVIDIA GPU acceleration (optional)
- Requires: CUDA Toolkit 11.0+
HIP: AMD GPU acceleration (optional, planned)
- Requires: ROCm 5.0+

Build Configuration

# Enable GPU acceleration
option(THEMIS_GPU_ENABLED "Enable GPU acceleration" ON)
option(THEMIS_HAS_FAISS "Enable FAISS integration" ON)
option(THEMIS_CUDA_ENABLED "Enable CUDA backend" ON)
option(THEMIS_VULKAN_ENABLED "Enable Vulkan backend" ON)

# Find packages
find_package(FAISS QUIET)
find_package(Vulkan QUIET)
find_package(CUDAToolkit QUIET)

Performance Characteristics

Vector Index (HNSW)

Construction: O(N log N) where N = number of vectors
Memory: 1.5-2x of raw vectors (HNSW graph), 0.01-0.1x with PQ
Search Latency:
- CPU (brute-force): 10-100ms per query (1M vectors)
- CPU (HNSW): 0.1-1ms per query
- GPU (Vulkan): 0.01-0.1ms per query (batch)
Throughput:
- CPU: 1K-10K queries/sec
- GPU: 100K-500K queries/sec (batch size 512)
Recall@10: 95-99% (configurable via efSearch)

Secondary Index (B-tree)

Lookup: O(log N) via RocksDB LSM-tree
Point query: 10-50μs (with cache)
Range scan: 100K-500K keys/sec
Space overhead: 20-40% of data size
Unique constraint: O(1) check via bloom filters

Graph Index (Adjacency List)

Add edge: O(1) with in-memory cache
Neighbor query: O(degree) = 10-50μs typically
BFS/DFS: O(|V| + |E|) = 100K-500K nodes/sec
Shortest path: O(|V| + |E|) with BFS
Space overhead: 2x edges (bidirectional index)

Spatial Index (R-tree)

Construction: O(N log N)
Point query: O(log N) = 10-100μs
Bounding box: O(log N + k) where k = result count
Radius search: O(log N + k)
Tree height: O(log_M N) where M=16 (fanout)
Space overhead: 30-50% of data size

Quantization Performance

Product Quantization (PQ):
- Compression: 10-100x (e.g., 1536D float32 → 8B codes)
- Search speedup: 2-5x (via ADC tables)
- Recall: 95-99% (configurable)
- Training: 1-10 minutes (100K vectors)
Binary Quantization:
- Compression: 256x (float32 → 1-bit)
- Search speedup: 10-50x (bitwise ops)
- Recall: 90-95%

Known Limitations

Vector Index

HNSW Construction: Cannot efficiently remove vectors (requires rebuild)
- Workaround: Mark as deleted, rebuild periodically
Memory Usage: HNSW requires all vectors in memory
- Workaround: Use IVF+PQ for disk-based indexes
High-Dimensional Data: Performance degrades beyond 2048D
- Workaround: Use dimensionality reduction (PCA, UMAP)
GPU Memory: Limited by VRAM (typically 8-24GB)
- Workaround: Batch processing, multi-GPU sharding

Secondary Index

No Full-Text Search: Only equality and range queries
- Planned: Full-text index in v1.7.0
Composite Index Ordering: Fixed column order for prefix scans
- Workaround: Create multiple indexes for different orderings
Unique Constraint Race: Possible with concurrent writes
- Workaround: Use pessimistic transactions

Graph Index

No Distributed Traversal: Single-node only
- Planned: Distributed graph queries in v1.7.0
Limited Cycle Detection: Expensive on large graphs
- Workaround: Set max_depth parameter
Memory Usage: In-memory cache for topology
- Workaround: Disable cache for large graphs (> 10M edges)

Spatial Index

2D/3D Only: No support for higher dimensions
- Workaround: Use vector index for high-dimensional spatial data
Global Bounds: Must know bounds in advance
- Workaround: Use conservative bounds, update periodically
LineString/Polygon: Limited to MBR approximation
- Planned: Exact geometry intersection in v1.7.0

Status

Current Version: v1.5.0+

Maturity:

Vector Index: ✅ Production Ready (v1.3.0+)
- HNSW: Stable, well-tested
- GPU Acceleration: Beta (Vulkan v2.2)
- FAISS Integration: Stable (v1.5.0+)
Secondary Index: ✅ Production Ready (v1.0.0+)
Graph Index: ✅ Production Ready (v1.2.0+)
Spatial Index: ⚠️ Beta (v1.4.0+)
- R-tree: Stable
- GeoJSON: Stable
- Full geometry operations: In progress
Adaptive Index: ⚠️ Beta (v1.5.0+)
- Pattern tracking: Stable
- Recommendations: Beta

Recent Changes:

v1.5.0: FAISS integration, IVF+PQ, ADC tables, Product Quantization
v1.4.2: Residual Quantization, Learned Quantization
v1.4.0: Spatial index (R-tree), Z-order curves
v1.3.0: GPU acceleration (Vulkan), rotary embeddings
v1.2.0: Graph analytics, temporal graphs, property graphs
v1.1.0: Composite indexes, sparse indexes, TTL indexes

Stability:

Test Coverage: 85%+ (comprehensive unit + integration tests)
Memory Leaks: Clean (Valgrind verified)
Thread Safety: All managers are thread-safe
Production Usage: Running in multiple deployments

Scientific References

Malkov, Y. A., & Yashunin, D. A. (2020). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824–836. https://doi.org/10.1109/TPAMI.2018.2889473
Guttman, A. (1984). R-Trees: A Dynamic Index Structure for Spatial Searching. Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, 47–57. https://doi.org/10.1145/602259.602266
Bayer, R., & McCreight, E. (1972). Organization and Maintenance of Large Ordered Indexes. Acta Informatica, 1(3), 173–189. https://doi.org/10.1007/BF00288683
Beckmann, N., Kriegel, H.-P., Schneider, R., & Seeger, B. (1990). The R-Tree: An Efficient and Robust Access Method for Points and Rectangles*. Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, 322–331. https://doi.org/10.1145/93597.98741
Johnson, J., Douze, M., & Jégou, H. (2019). Billion-Scale Similarity Search with GPUs. IEEE Transactions on Big Data, 7(3), 535–547. https://doi.org/10.1109/TBDATA.2019.2921572

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

ThemisDB Index Module

Module Purpose

Relevant Interfaces

Scope

Key Components

IndexManager

VectorIndexManager

SecondaryIndexManager

GraphIndexManager

SpatialIndexManager

AdaptiveIndexManager

Architecture

Dependency Injection

GPU Acceleration Pipeline

Quantization Pipeline

Integration Points

With Storage Module

With Query Module

With Server Module

API/Usage Examples

Complete RAG Pipeline

Graph Knowledge Base

Geospatial Search

Dependencies

Required

Optional

Build Configuration

Performance Characteristics

Vector Index (HNSW)

Secondary Index (B-tree)

Graph Index (Adjacency List)

Spatial Index (R-tree)

Quantization Performance

Known Limitations

Vector Index

Secondary Index

Graph Index

Spatial Index

Status

Related Documentation

Module Documentation

Technical Documentation

API References

Examples

External Resources

Scientific References