- BM25 hybrid search with Reciprocal Rank Fusion (RRF) blending lexical and semantic results; configurable k1/b/RRF-k constants
- HNSW vector indexing supporting Cosine, Dot Product, and L2 distance metrics with 85%+ recall@10
- LLM-based query rewriting to improve recall for underspecified or noisy queries
- Faceted search with per-field value-count facets and numeric range buckets with drill-down filtering
- Learning-to-rank (LTR) re-ranker using a 6-dimensional feature vector with online pairwise gradient-descent training
- Multi-field boosting (title > body > tags) with configurable per-field weight vectors
- Neural sparse retrieval (SPLADE) for vocabulary expansion and out-of-vocabulary term handling
- Cross-lingual semantic search via multilingual embedding models with automatic language detection
- Hybrid search (BM25 + HNSW RRF) must return top-10 results in ≤ 20 ms at p99 for indices up to 10 M documents
- BM25 default parameters (k1=1.2, b=0.75) must remain unchanged; parameter changes must be explicit
-
HybridSearchAPI (RRF weights, mode selection) must remain stable from v1.2.0 onwards; no breaking changes - LLM query rewriting must degrade gracefully to the original query when the LLM is unavailable or exceeds 200 ms
- LTR model updates must be applied online (no index rebuild) and must not affect in-flight queries
- SPLADE sparse vectors must be stored in a compressed sparse row (CSR) format to cap memory at ≤ 4 GB per 10 M docs
- Cross-lingual search must support at least 50 languages via a single multilingual model with no per-language index
- Facet counting must complete in ≤ 5 ms for up to 1,000 distinct facet values on a result set of 100,000 documents
| Interface | Consumer | Notes |
|---|---|---|
HybridSearch::search(query, top_k, mode) |
HTTP API, SDK | Modes: TEXT / VECTOR / HYBRID; returns SearchResult list |
LlmQueryRewriter::rewrite(query) |
HybridSearch pre-processor | Returns rewritten query string; 200 ms timeout with original fallback |
FacetedSearch::facet(results, fields) |
API response builder | Returns FacetMap; supports value-count and numeric range buckets |
LearningToRank::rerank(query, results) |
Result post-processor | Online pairwise training; model update via train(feedback) |
MultiFieldBoostedSearch::search(query, weights) |
API handler | Per-field boost weights map; falls back to equal weights if empty |
NeuralSparseRetriever::retrieve(query, top_k) |
Hybrid retrieval pipeline | SPLADE sparse vectors; CSR storage format |
CrossLingualSearch::search(query, lang_hint) |
Multilingual API | Auto-detects language; uses multilingual embedding model |
SearchAnalytics::record(query, latency_ms) |
Request middleware | Thread-safe; exports avg/p95/p99 latency and zero-result rate |
The following production-readiness improvements were delivered in v1.4.0:
- Configurable vector metric (COSINE / DOT / L2) — was hardcoded to COSINE
- Strict config validation with
std::invalid_argumenton bad parameters - Hard resource limits (
max_k,max_candidates) to bound memory / latency - Score normalization edge-case fixes (range == 0 / single-result)
- Linear-combination pre-normalization (scores always in [0,1] before weighting)
SearchStatsstruct for graceful degradation and partial-result detection- Thread-safety and exception-safety documentation
normalizeScorespromoted topublic staticfor testability- Comprehensive test suite:
test_hybrid_search.cpp,test_rrf_fusion.cpp,test_score_normalization.cpp,test_hybrid_search_integration.cpp benchmark_hybrid_search.cppfor algorithmic performance measurement
The following high-priority features were delivered in v1.5.0:
- Synonym expansion with configurable maximum expansion count
- Per-token spelling correction using Levenshtein edit distance against a user-supplied vocabulary
- Alternative query generation by substituting tokens with their synonyms
- Zero-result fallback via
relaxQuery()(drops the last token) - Tests:
tests/test_query_expander.cpp
- Levenshtein edit-distance search wrapping
SecondaryIndexManager::scanFulltextFuzzy - Public static utilities:
levenshtein(),soundex(),metaphone(),ngramSimilarity() - Unified [0,1] similarity score across all algorithms
- Bigram/N-gram overlap (Dice coefficient)
- Tests:
tests/test_fuzzy_matcher.cpp
computeFacet()— value-count facets for a single columncomputeFacets()— multi-column facets in one callcomputeRangeFacet()— numeric range bucketsapplyFacetFilters()— intersect candidate PK sets with active field=value constraints- Tests:
tests/test_faceted_search.cpp
- Thread-safe query event logging with configurable max-events cap (circular eviction)
computeMetrics()— average, p95, p99 latency; zero-result rate; top-20 queriesgetZeroResultQueries()— retrieve most-recent zero-result events for alertinggetRecentEvents()— retrieve recent event log- Tests:
tests/test_search_analytics.cpp
suggestByPrefix()— secondary-index prefix scan returning matching field valuessuggestPopular()— popular past queries starting with the prefix (backed by SearchAnalytics)suggest()— combined, deduplicated, score-ranked suggestions from both sources- Tests:
tests/test_autocomplete.cpp
rerank()— dot-product linear re-ranker over 6-dimensionalRankingFeaturesrecordClick()/train()— pairwise gradient-descent training from click-through eventsregisterVariant()/selectVariant()— deterministic A/B traffic splittingrerankWithVariant()— apply a named scorer; falls back to default if variant not found- Tests:
tests/test_learning_to_rank.cpp
- Accepts
ModalQuerycomponents (TEXT, IMAGE, AUDIO, CUSTOM) each with a weight - TEXT modality: delegates to
SecondaryIndexManager::scanFulltextWithScores - Embedding modalities: delegates to
VectorIndexManager::searchKnn - All result lists fused via weighted RRF; top-k returned
searchTextAndImage()convenience method for the most common bi-modal case- Tests:
tests/test_multi_modal_search.cpp
- LLM-based query rewriting for improved recall via injected
LlmBackendcallback - Builds structured prompts instructing the LLM to produce
nnumbered alternative phrasings - Parses numbered lines (supporting
.,), and:separators) from LLM response - Deduplication: case-insensitive comparison removes duplicate and original-equivalent rewrites
- Length guard: rewrites exceeding
max_rewrite_lengthare silently dropped - Graceful fallback: returns original query when backend is absent, throws, or produces no output
setBackend(): runtime backend replacement (e.g. after model load)- Tests:
tests/test_llm_query_rewriter.cpp
SpellingCorrectionstruct:suggestion,edit_distance,confidence(normalized [0,1])QueryExpander::suggestSpellingCorrections(word, max_suggestions)— returns up tomax_suggestionsranked candidates from the registered vocabulary withinmax_edit_distance, sorted by ascending edit distance (alphabetical tiebreak); confidence decays linearly with distanceQueryExpander::suggestQueryCorrections(query, max_suggestions)— tokenizes the query, builds per-token substitution variants and one all-corrected variant, deduplicates, and sorts by total edit distance ascending- Tests:
tests/test_query_expander.cpp(14 new test cases)
- Configurable re-ranking with LLM feedback loop, closing Phase 3 of the Search Roadmap
- Injected
LlmBackendcallable (same pattern asLlmQueryRewriter) — no hard LLM library dependency LlmRerankCandidateandLlmRerankResultvalue types;rerank()accepts any result source- Batched prompting: splits candidates into configurable
batch_sizegroups for cost control - Score parsing: per-document 0–10 integer scores from the LLM response, normalised to [0, 1]
- Out-of-range score clamping (values outside [0, 10] clamped before normalisation)
- Configurable blending:
final = llm_weight * llm_score + (1 - llm_weight) * initial_score min_score_thresholdfilter: removes low-confidence results from the output- Graceful fallback: returns original order when backend is absent, throws, or produces no output
setBackend(): runtime backend replacement (e.g. after model load)toClickEvents()static method: converts LLM relevance judgments toClickEventobjects for direct consumption byLearningToRank::recordClick(), closing the feedback loop- Tests:
tests/test_llm_reranker.cpp
HybridSearch::setReranker()attaches anLlmRerankerto the search pipeline- Applied as a post-fusion step after RRF (or linear combination): top-N results are re-scored
by the LLM and returned in LLM-determined order with updated
hybrid_score - Null backend disables re-ranking;
search()returns RRF order unchanged HybridSearch::Result::contentfield forwarded toLlmRerankCandidate::content- Tests:
tests/test_hybrid_search.cpp(HybridSearchRerankertest group)
- Multi-field boosted full-text search with per-field BM25 scoring and configurable boost weights
- Implements the "title > body > tags" priority: title boost=3.0, body boost=1.0, tags boost=0.5
FieldConfigvalue type:{table, column, boost}— fully configurable, any field combinationsearch(query, fields)— queries each field independently, normalises BM25 scores to [0, 1] per field, multiplies by the field boost, and accumulates into a combined per-document score- Score combination formula:
score(doc) = Σ_f( boost_f × normalised_bm25_score_f(doc) ) - Documents appearing in only a subset of fields are still ranked (missing fields contribute 0)
defaultFields(table)— convenience factory: title/body/tags with the canonical boost hierarchyResultstruct:document_id, combinedscore, and per-fieldfield_scoresbreakdown- Graceful handling: negative-boost fields are skipped with a warning; empty query/fields return
immediately; null index returns empty without throwing; all exceptions caught inside
search() - Exception safety:
search()never throws; errors logged viaTHEMIS_WARN/THEMIS_ERROR - Tests:
tests/test_multi_field_search.cpp
- Per-user interaction history tracking for personalized search result re-ranking
InteractionTypeenum: VIEW (0.2), CLICK (0.5), BOOKMARK (1.0), LIKE (1.0), DISLIKE (-0.5)UserInteractionstruct:user_id,document_id,type,timestamp- Time-decayed scoring:
score += type_weight * exp(-decay_rate * age_days)— more recent interactions contribute more - Configurable decay rate (default 0.05 ≈ half-weight after ~14 days), boost weight, and per-user buffer size
- Per-user history bounded by
Config::max_interactions_per_userwith oldest-first eviction computeScore(user_id, doc_id)— returns personalization score in [-1, 1] for any (user, document) pairapplyPersonalization(user_id, candidates)— adjustsfinal_scoreof ranked candidates and re-sortsgetUserInteractions(user_id)— inspect stored interactions (most-recent first)clearUser(user_id)/clear()— GDPR-compatible user-data removal- Thread-safe: all methods protected by a shared
std::mutex - Tests:
tests/test_personalized_ranker.cpp(28 test cases)
- Cross-lingual semantic search via multilingual embeddings (Phase 4 of the Search Roadmap)
- Model-agnostic design: callers supply pre-computed float vectors (e.g. from
paraphrase-multilingual-mpnet-base-v2or LaBSE); no hard embedding-library dependency search(embedding, hints)— single-embedding kNN query with distance-to-similarity conversion (1 / (1 + distance)), optional per-language boost, score threshold filter, k-capsearchMultiEmbedding(queries, hints)— issues independent kNN queries for eachEmbeddingQuery(embedding + weight), fuses ranked lists via weighted Reciprocal Rank Fusion (RRF), then applies language boosts and threshold filteringsetLanguageMap(map)— supplies adoc_id → language_code(ISO 639-1) mapping for result annotation (Result::language) andLanguageHintboost applicationLanguageHintstruct:{language_code, boost}— multiplies the final score of results in the specified language; zero or negative boost hints are silently ignoredEmbeddingQuerystruct:{embedding, weight}— weight drives the RRF contribution term- Resource safety:
kandcandidatesare clamped tomax_k/max_candidatesat construction time;search()andsearchMultiEmbedding()never throw - Config validation: throws
std::invalid_argumenton k=0, candidates=0, rrf_k≤0 - Tests:
tests/test_cross_lingual_search.cpp
- SPLADE / BERT-based neural sparse retrieval engine, closing Phase 4 item of the Search Roadmap
SparseVectortype alias:unordered_map<string, float>— non-zero learned term weightsSparseEncoderBackendcallable type — same injected-backend pattern asLlmReranker::LlmBackendaddDocument(doc_id, sparse_vec)— indexes a pre-computed sparse vector; negative weights clampedaddDocumentText(doc_id, text)— encodes text via attached backend then indexes itremoveDocument(doc_id)— clean removal from both forward and inverted indexessearch(query_vec, k)— inverted-index dot-product accumulation; O(|q| × postings)searchText(query_text, k)— encodes query then callssearch(); never throws- Scoring:
score(q, d) = Σ_t( q[t] * d[t] )— standard inner product over shared terms Config::max_terms_per_doc— soft cap; top-weighted terms kept, remainder discardedConfig::score_threshold— pre-normalization minimum score filterConfig::normalize_scores— optional linear rescaling to [0, 1] (same edge-case handling asHybridSearch::normalizeScores)normalizeScores()promoted topublic staticfor direct unit testing- Result struct:
document_id,score(normalized),raw_score(inner product before normalization) - Graceful fallback: no encoder → no-op for text methods; encoder throws → empty result, no propagation
- Tests:
tests/test_neural_sparse_retrieval.cpp
Priority: High
Status: ✅ Delivered in v1.5.0 — see include/search/query_expander.h
Automatically expand and rewrite queries for better results.
Features:
- Synonym expansion using thesaurus
- Query relaxation for zero-result queries
- Spelling correction and suggestion
- Query intent detection
- Automatic phrase detection
Implementation:
class QueryExpander {
public:
struct ExpansionConfig {
bool use_synonyms = true;
bool correct_spelling = true;
bool detect_phrases = true;
double synonym_weight = 0.8;
size_t max_expansions = 5;
};
Result<ExpandedQuery> expand(
const std::string& query,
const ExpansionConfig& config
);
Result<std::string> correctSpelling(
const std::string& query
);
Result<std::vector<std::string>> suggestAlternatives(
const std::string& query
);
};Priority: Medium
Status: ✅ Delivered in v1.5.0 — see include/search/fuzzy_matcher.h
Enhanced fuzzy search with phonetic algorithms.
Features:
- Levenshtein distance for edit distance
- Soundex and Metaphone for phonetic matching
- N-gram matching
- Configurable fuzziness levels
- Performance optimization with BK-trees
Expected Performance:
- Fuzzy search overhead: <50% vs exact search
- Support 1-2 edit distance efficiently
Priority: Medium
Status: ✅ Delivered in v1.5.0 — see include/search/multi_modal_search.h
Search across text, images, and other modalities.
Features:
- Image search using CLIP embeddings
- Audio search for voice queries
- Cross-modal retrieval (text→image, image→text)
- Unified embedding space
Priority: Medium
Status: ✅ Delivered in v1.5.0 — see include/search/learning_to_rank.h
Machine learning-based result ranking.
Features:
- Training from click-through data
- Multiple ranking features
- Online learning and adaptation
- A/B testing framework
Priority: High
Status: ✅ Delivered in v1.5.0 — see include/search/search_analytics.h
Track and analyze search performance.
Features:
- Query log analysis
- Click-through rate tracking
- Zero-result query detection
- Search performance metrics
- User behavior analysis
Priority: High
Status: ✅ Delivered in v1.5.0 — see include/search/faceted_search.h
Multi-dimensional filtering and navigation.
Features:
- Dynamic facet generation
- Range facets (price, date)
- Hierarchical facets (categories)
- Facet count computation
- Drill-down navigation
Priority: Medium
Status: ✅ Delivered in v1.5.0 — see include/search/autocomplete.h
Real-time query suggestions.
Features:
- Prefix-based suggestions
- Popular query suggestions
- Personalized suggestions
- Context-aware completion
- Instant search results
- Query expansion overhead: <20% (in-memory synonym+spelling correction)
- Fuzzy search: Within 2x of exact search (Levenshtein wrapped over BM25 index)
- Faceted search: <50ms for 10 facets (secondary index range scans)
- Autocomplete latency: <10ms (prefix range scan + analytics top-queries)
- LTR inference: <1ms per query (linear dot product, 6 features)
- Multi-modal search: sub-millisecond RRF fusion (embeddings pre-computed by caller)
- Personalized autocomplete (per-user click history)
- Neural LTR (LambdaMART or small MLP) with offline batch training
- Multi-namespace VectorIndexManager (one instance per modality)
- Streaming result delivery for large k values
- Unit test coverage ≥ 80% for
HybridSearch,BM25Ranker,LlmQueryRewriter,FacetedSearch,LearningToRank,NeuralSparseRetriever, andCrossLingualSearch - BM25 correctness tests: compute BM25 scores for 10 hand-crafted query-document pairs and assert results match the reference formula within ± 0.001
- Hybrid recall@10 integration test: evaluate against a 1 M-document corpus; assert recall@10 ≥ 85% in HYBRID mode
- LLM query rewriter fallback test: simulate LLM timeout (> 200 ms); assert original query is used and no exception is thrown
- LTR online training test: feed 1,000 pairwise preference events; assert MRR@10 improves by ≥ 5% relative to the untrained model after 500 updates
- Faceted search tests: assert value-count facets are correct for 1,000 distinct values and numeric range buckets respect configured boundaries
- SPLADE neural sparse retrieval test: assert retrieved document IDs overlap ≥ 70% with BM25 top-20 on the BEIR TREC-COVID split
- Cross-lingual search tests: submit queries in 5 languages (EN, DE, FR, ES, ZH); assert top-1 result is in the correct language-agnostic answer set
- Hybrid search (BM25 + HNSW RRF) latency: ≤ 20 ms at p99 for top-10 results on an index of up to 10 M documents
- HNSW index build throughput: ≥ 10,000 vectors/s for 768-dimensional embeddings on a single CPU node
- BM25 ranking throughput: ≥ 50,000 queries/s for simple term queries against a 1 M-document index on a 16-core host
- LLM query rewriter overhead: ≤ 200 ms added latency at p99; 0 ms when LLM is unavailable (passthrough)
- Facet counting latency: ≤ 5 ms for up to 1,000 distinct facet values on a result set of 100,000 documents
- LTR re-ranking latency: ≤ 2 ms for re-ranking top-100 candidates with the 6-dimensional linear model
- SPLADE index memory: ≤ 4 GB for a 10 M-document corpus stored in CSR format
- Autocomplete suggestion latency: ≤ 5 ms at p99 for prefix queries against a 1 M-term dictionary
- Query injection prevention: all user-supplied query strings must be length-limited (≤ 4,096 bytes) and stripped of control characters before being passed to the BM25 tokenizer or LLM rewriter
- Resource exhaustion:
max_kandmax_candidateshard limits must be enforced before HNSW traversal begins; requests exceeding these limits return HTTP 400, not a partial result - LTR model integrity: the learned weight vector must be persisted with a checksum; a corrupted model file must be detected on load and the system must fall back to uniform weights
- SPLADE vocabulary: the sparse vocabulary index must be loaded from a read-only memory-mapped file; writes to the vocabulary at runtime are rejected
- Cross-lingual model: the multilingual embedding model binary must be verified against a SHA-256 hash at startup; a hash mismatch aborts loading and falls back to the BM25-only mode
SearchAnalyticsquery log must not store raw query strings containing PII; queries are hashed (SHA-256) before storage- Distributed shard search must authenticate inter-shard requests with mTLS; unauthenticated shard calls are rejected with a 401 error
[1] T. Formal, B. Piwowarski, J. Genthial, and S. Clinchant, "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking," in Proc. 44th Int. ACM SIGIR Conf. Res. Dev. Inf. Retrieval (SIGIR), pp. 2288–2292, 2021. https://doi.org/10.1145/3404835.3463098
[2] T. Formal, C. Lassance, B. Piwowarski, and S. Clinchant, "From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective," in Proc. 45th Int. ACM SIGIR Conf. Res. Dev. Inf. Retrieval (SIGIR), pp. 2353–2359, 2022. https://doi.org/10.1145/3477495.3531857
[3] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang, "Language-Agnostic BERT Sentence Embedding," in Proc. 60th Annu. Meet. Assoc. Comput. Linguistics (ACL), pp. 878–891, 2022. https://doi.org/10.18653/v1/2022.acl-long.62
[4] J. Reimers and I. Gurevych, "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation," in Proc. 2020 Conf. Empirical Methods Nat. Lang. Process. (EMNLP), pp. 4512–4525, 2020. https://doi.org/10.18653/v1/2020.emnlp-main.365
[5] C. J. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, "Learning to Rank using Gradient Descent," in Proc. 22nd Int. Conf. Mach. Learn. (ICML), pp. 89–96, 2005. https://doi.org/10.1145/1102351.1102363
[6] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, "Learning to Rank: From Pairwise Approach to Listwise Approach," in Proc. 24th Int. Conf. Mach. Learn. (ICML), pp. 129–136, 2007. https://doi.org/10.1145/1273496.1273513
[7] N. Craswell, D. Hawking, and P. Thistlewaite, "Merging Results from Isolated Search Engines," in Proc. 10th Australasian Database Conf. (ADC), pp. 189–200, 1999.
[8] M. Shokouhi and L. Si, "Federated Search," Foundations and Trends in Information Retrieval, vol. 5, no. 1, pp. 1–102, 2011. https://doi.org/10.1561/1500000010
[9] C. Carpineto and G. Romano, "A Survey of Automatic Query Expansion in Information Retrieval," ACM Comput. Surv., vol. 44, no. 1, pp. 1–50, 2012. https://doi.org/10.1145/2071389.2071390
[10] W. Yu et al., "Generate Rather Than Retrieve: Large Language Models Are Strong Context Augmenters," in Proc. 11th Int. Conf. Learn. Representations (ICLR), 2023. https://openreview.net/forum?id=ZuEW6ue886
[11] R. Nogueira and K. Cho, "Passage Re-ranking with BERT," arXiv preprint arXiv:1901.04085, 2019. https://arxiv.org/abs/1901.04085
[12] Y. Zhuang, B. Liu, B. Koopman, and G. Zuccon, "SPLADE-v3: New Baselines for SPLADE," arXiv preprint arXiv:2310.14276, 2023. https://arxiv.org/abs/2310.14276
- README.md - Current module documentation
- Header Documentation - Public API
- Index Module - Index improvements
Last Updated: February 2026
Module Version: v1.7.0
Next Review: v1.8.0 Release