LLM interaction storage and chain-of-thought feature implementation for ThemisDB.
Implements LLM interaction storage and chain-of-thought features for ThemisDB. Provides two inference engines: AsyncInferenceEngine (lightweight async wrapper for single LLM plugin) and InferenceEngineEnhanced (enterprise multi-model engine with context caching, batch processing, and load balancing).
In scope: Async inference request management, priority queue, context caching (KV-cache reuse), dynamic batching, multi-model load balancing, InferenceHandle for request tracking.
Out of scope: LLM model weights and serving (external), prompt template management (handled by prompt_engineering module), RAG pipeline orchestration (handled by rag module).
async_inference_engine.cpp— lightweight async inference wrapperinference_engine_enhanced.cpp— enterprise multi-model engineinclude/llm/inference_handle.h— async request handlellm_api_handler.cpp(server) — API integration point
Maturity: 🟢 Production-ready (v1.16.0) — Both inference engines operational; streaming SSE output, OpenAI-compatible adapter, speculative decoding, LoRA hot-loading, model quantization pipeline, and request deduplication cache are all complete.
ThemisDB provides two distinct inference engines serving different purposes:
- Purpose: Lightweight async wrapper for single LLM plugin
- Use Case: Simple API endpoints, background inference tasks
- Features:
- Non-blocking request submission
- Priority queue management
- Worker thread pool
- Backpressure handling
- Location:
src/llm/async_inference_engine.cpp - Usage: Server API handlers (
src/server/llm_api_handler.cpp)
- Purpose: Advanced multi-model engine with optimization features
- Use Case: RAG systems, production deployments, high-throughput scenarios
- Features:
- Context Caching: KV-cache reuse for faster inference
- Batch Processing: Dynamic batching for improved throughput
- Load Balancing: Multi-model request distribution
- Request Queuing: Priority scheduling with timeouts
- Location:
src/llm/inference_engine_enhanced.cpp - Usage: RAG integration (
src/rag/llm_integration.cpp)
- Purpose: Common handle for tracking async inference requests
- Location:
include/llm/inference_handle.h - Features:
- Blocking wait for results (
get()) - Non-blocking status check (
ready()) - Best-effort cancellation (
cancel())
- Blocking wait for results (
Initially, this appeared to be code duplication. Investigation revealed:
-
Different Abstraction Levels:
- AsyncInferenceEngine: Simple async wrapper (single model)
- InferenceEngineEnhanced: Enterprise orchestrator (multi-model)
-
Different Use Cases:
- Simple API calls → AsyncInferenceEngine
- Complex RAG pipelines → InferenceEngineEnhanced
-
Minimal Overlap:
- Both implement worker threads (necessary for each)
- Different queue strategies (priority vs. batch)
- Different statistics tracking (basic vs. advanced)
Problem: InferenceEngineEnhanced included async_inference_engine.h but only used InferenceHandle
Solution: Extracted InferenceHandle to separate header
- Created:
include/llm/inference_handle.h - Created:
src/llm/inference_handle.cpp - Removed unnecessary cross-dependency
- Both engines now depend only on shared handle
This clarifies that both engines are independent implementations serving different needs.
- LLM interaction storage
- Prompt and response tracking
- Chain-of-thought storage
- Conversation history management
Status: Fully implemented with runtime API detection
The LLM module includes complete implementation for grammar-constrained generation (EBNF/GBNF format), which guarantees valid structured outputs. This feature uses runtime API detection similar to LoRA adapters.
How It Works:
- On first use, the system detects if llama.cpp has grammar APIs available
- If available: Full grammar-constrained generation is enabled
- If not available: System falls back gracefully to unconstrained generation
Implementation Details:
"Grammar support is unavailable (llama grammar API not present)"
Location: Grammar::compile() in src/llm/grammar.cpp
Dynamic API Loading: src/llm/llama_grammar_adapter.cpp
Required APIs from llama.cpp:
llama_grammar_init()- Compile EBNF to grammarllama_grammar_free()- Free grammar resourcesllama_grammar_sample()- Filter tokens by grammar rulesllama_grammar_accept()- Update grammar state after token generation
Runtime Detection:
- Uses
themis_llama_grammar_available()to check API availability - Automatically activates when llama.cpp has grammar support
- Graceful fallback with informative logging if not available
- No rebuild needed when llama.cpp is updated
Usage:
// Grammar support is automatically detected and used
Grammar grammar(ebnf_text, "root");
if (grammar.isValid()) {
// Grammar APIs are available and working
} else {
// APIs not available, will use unconstrained generation
}See Also:
docs/GRAMMAR_IMPLEMENTATION_COMPLETE.md- Full grammar documentationdocs/LLM_IMPLEMENTATION_COMPLETE.md- LLM implementation status (100%)docs/LLM_CORE_STATUS_MASTER.md- Master status document
- Store LLM interactions and conversations
- Track reasoning chains and intermediate steps
- Support for multi-turn conversations
- Integration with vector search for semantic retrieval
For LLM documentation, see:
-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 30, 5998–6008. https://arxiv.org/abs/1706.03762
-
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., … Amodei, D. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems (NeurIPS), 33, 1877–1901. https://arxiv.org/abs/2005.14165
-
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., … Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems (NeurIPS), 35. https://arxiv.org/abs/2201.11903
-
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019, 4171–4186. https://doi.org/10.18653/v1/N19-1423