This document covers planned enhancements to ThemisDB's prompt engineering subsystem, which manages LLM prompt templates, chain-of-thought construction, RAG prompt assembly, and system prompt versioning. It targets the gap between the current production-ready state of the core components and advanced future capabilities such as a typed DSL, CoT tracing, and automated quality regression.
- Prompt templates must be versioned and immutable once published; changes must produce a new version tracked by
prompt_version_control.cpp. - The module must not introduce hard dependencies on a specific LLM provider; model-specific behaviour must be encapsulated behind the
IPromptRendererinterface. - Prompt construction latency must not exceed 5 ms (P99) on the critical inference path to stay within the latency budget of the RAG pipeline.
- Prompt content that may include PII must be routed through
utils/pii_detector.cppbefore transmission to any external endpoint.
| Interface | Consumer | Notes |
|---|---|---|
PromptManager::render(template_id, context) |
prompt_engineering_integration.cpp, RAG module |
Returns rendered string; throws on missing variables |
PromptVersionControl::publish(template) |
CI/CD pipeline, admin API | Immutable publish; version hash stored in metadata DB |
PromptEvaluator::score(prompt, response) |
feedback_collector.cpp, self_improvement_orchestrator.cpp |
Returns quality score 0–1 |
PromptPerformanceTracker::record() |
prompt_engineering_metrics.cpp |
Per-template token count, latency, cost |
MetaPromptGenerator::generate(task_spec) |
self_improvement_orchestrator.cpp |
Synthesises new templates from task descriptions |
Priority: High Target Version: v0.9.0
Replace ad-hoc string interpolation in prompt_manager.cpp with a typed template DSL that supports typed variable slots (string, list, document-chunk), conditional blocks, and loop constructs. The DSL compiles to a CompiledPromptTemplate object that is validated at publish time rather than at render time.
Implementation Notes:
- Add
prompt_template_compiler.cppwith a recursive-descent parser for the DSL; exposeCompiledPromptTemplate::render(PromptContext&). - Update
prompt_version_control.cppto store the compiled AST alongside the raw template source for faster re-render. prompt_manager.cppmust fall back to legacy string-substitution for templates without a DSL version field (backward compat).- Integrate variable-type validation with
utils/input_validator.cppto catch mismatched context variables at publish time.
Performance Targets:
- Template compilation (publish time): <50 ms for a 4 KB template.
- Compiled template render latency: <1 ms P99 for a 2 KB context.
Priority: High Target Version: v0.9.0
Instrument chain-of-thought prompt construction so that each reasoning step is individually traced via utils/tracing.cpp. This enables offline analysis of which CoT steps contribute to answer quality and which introduce hallucination risk in the legal-domain context.
Implementation Notes:
- Extend
prompt_engineering_integration.cppwith aCoTTraceCollectorthat emits one OpenTelemetry span per CoT step, carryingstep_index,token_count, andtemplate_idattributes. - Wire
CoTTraceCollectorinto the existingutils/tracing.cppspan context so traces propagate correctly through the RAG pipeline. prompt_performance_tracker.cppmust aggregate per-step token counts for cost attribution by legal matter ID.- Store CoT traces in the timeseries module (
timeseries/tsstore.h) for retention and aggregation.
Performance Targets:
- Tracing overhead per CoT step: <0.2 ms.
- Trace storage: <500 bytes per step after ZSTD compression (
utils/zstd_codec.cpp).
Priority: Medium Target Version: v0.10.0
Build a regression harness around prompt_evaluator.cpp that runs on every template version publish to detect quality regressions. The harness compares the new template's PromptEvaluator score against the previous published version on a fixed golden-set of legal-domain prompts.
Implementation Notes:
- Add
prompt_regression_runner.cppthat loads golden-set fixtures fromtests/prompt_engineering/golden/and callsPromptEvaluator::score()for each. - Integrate with
feedback_collector.cppto pull real-world human-feedback scores as additional regression signal. - Block publish in
prompt_version_control.cppif the mean regression-suite score drops >5% vs. the current published version. - Emit regression results as structured log entries via
utils/logger.cppwith template ID, version, and per-fixture delta scores.
Performance Targets:
- Full regression suite (100 golden prompts) runtime: <60 s against a mock LLM stub.
- False-positive regression block rate: <2% over a 30-day rolling window.
Priority: High Target Version: v0.9.0
Add a ContextWindowBudgetManager to prompt_engineering_integration.cpp that enforces per-model token limits. It ranks retrieved document chunks by relevance score, then greedily packs chunks until the token budget is reached, ensuring the system prompt and CoT scaffolding always fit.
Implementation Notes:
- Implement
ContextWindowBudgetManagerin a newcontext_window_manager.cpp; consume the model'smax_tokensfrom themeta_prompt_generator.cppmodel registry. - Token counting must use the model's actual tokenizer (tiktoken-compatible BPE); add
tokenizer_bridge.cppwrapping a shared library call. - Expose a
PromptBudgetExceededErrorstructured error for callers to handle gracefully (e.g., reduce chunk count). - Track budget utilization per request in
prompt_engineering_metrics.cppfor capacity planning.
Performance Targets:
- Budget computation for 20 candidate chunks: <2 ms P99.
- Token counting via BPE bridge: <0.5 ms per 512-token chunk.
Priority: Medium Target Version: v1.0.0
Extend prompt_version_control.cpp and prompt_optimizer.cpp to support traffic-split A/B experiments between prompt template versions. Experiment assignment is deterministic per request_id (hash-based) and results feed into feedback_collector.cpp for statistical significance testing.
Implementation Notes:
- Add
PromptExperimententity toprompt_version_control.cpp: stores control/treatment template IDs, traffic split ratio, and start/end timestamps. prompt_manager.cpp::render()accepts an optionalexperiment_contextstruct; if present, selects variant viamurmur3(request_id) % 100 < split_pct.self_improvement_orchestrator.cppconsumes experiment outcome data to auto-promote the winning variant after reaching statistical significance (p < 0.05, min 200 samples).- Emit per-variant metrics to
prompt_performance_tracker.cpptagged withexperiment_idandvariant.
Performance Targets:
- Variant selection overhead: <0.1 ms per request.
- Minimum detectable effect size at p < 0.05 with 200 samples: 10% relative score improvement.
| Test Type | Coverage Target | Notes |
|---|---|---|
| Unit | >85% new code | Cover DSL compiler, ContextWindowBudgetManager, experiment variant selection |
| Integration | All prompt lifecycle stages | Template compile → publish → render → evaluate → feedback loop |
| Regression | 100% golden-set prompts | Run on every template publish via prompt_regression_runner.cpp |
| Performance | P99 < budgets above | Micro-benchmark render, token counting, and CoT tracing paths |
| Metric | Current | Target | Method |
|---|---|---|---|
render() latency P99 |
~8 ms (string interp) | <1 ms | Compiled DSL + warm context cache |
| Context packing for 20 chunks | N/A | <2 ms | ContextWindowBudgetManager microbenchmark |
| End-to-end prompt assembly (RAG) | ~15 ms | <5 ms | Trace span aggregation in utils/tracing.cpp |
PromptEvaluator::score() throughput |
N/A | >500 req/s | Batch scoring with mock LLM stub |
| Template publish (compile + validate) | N/A | <50 ms | DSL compiler benchmark on 4 KB template |
- Rendered prompts containing fields derived from user input must pass through
utils/pii_detector.cppbefore transmission; any detected PII must be pseudonymized viautils/pii_pseudonymizer.cpp. - Prompt templates loaded from external sources must be stored with an integrity hash in
prompt_version_control.cpp; hash mismatch at render time must abort execution and emit an audit event viautils/audit_logger.cpp. - The A/B experimentation framework must not leak experiment assignments across tenant boundaries;
experiment_contextmust be scoped to a single tenant ID. - [?] Clarify whether chain-of-thought traces containing legal case content are subject to e-discovery retention requirements before enabling long-term storage.
-
ContextWindowBudgetManagermust enforce a hard maximum token cap regardless of model-reported limit to prevent prompt-injection via oversized context chunks.
The planned enhancements are grounded in the following peer-reviewed literature and industry research:
[1] L.-H. Beurer-Kellner et al., "Prompting Is Programming: A Query Language for Large Language Models," in Proc. PLDI 2023, pp. 1946–1969, 2023. [DOI: 10.1145/3591300] Available: https://arxiv.org/abs/2212.06094
[2] A. Dohan et al., "Language Model Cascades," arXiv preprint arXiv:2207.10342, 2022. Available: https://arxiv.org/abs/2207.10342
[3] J. White et al., "A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT," arXiv preprint arXiv:2302.11382, 2023. Available: https://arxiv.org/abs/2302.11382
[4] J. Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," in Proc. NeurIPS, vol. 35, pp. 24824–24837, 2022. Available: https://arxiv.org/abs/2201.11903
[5] X. Wang et al., "Self-Consistency Improves Chain of Thought Reasoning in Language Models," in Proc. ICLR 2023, 2023. Available: https://arxiv.org/abs/2203.11171
[6] T. Kojima et al., "Large Language Models are Zero-Shot Reasoners," in Proc. NeurIPS, vol. 35, pp. 22199–22213, 2022. Available: https://arxiv.org/abs/2205.11916
[7] P. Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," in Proc. NeurIPS, vol. 33, pp. 9459–9474, 2020. Available: https://arxiv.org/abs/2005.11401
[8] Y. Gao et al., "Retrieval-Augmented Generation for Large Language Models: A Survey," arXiv preprint arXiv:2312.10997, 2023. Available: https://arxiv.org/abs/2312.10997
[9] Z. Shi et al., "REPLUG: Retrieval-Augmented Black-Box Language Models," arXiv preprint arXiv:2301.12652, 2023. Available: https://arxiv.org/abs/2301.12652
[10] Y. Zhou et al., "Large Language Models Are Human-Level Prompt Engineers," in Proc. ICLR 2023, 2023. Available: https://arxiv.org/abs/2211.01910
[11] R. Pryzant et al., "Automatic Prompt Optimization with 'Gradient Descent' and Beam Search," in Proc. EMNLP 2023, pp. 7957–7968, 2023. Available: https://arxiv.org/abs/2305.03495
[12] C.-Y. Lin, "ROUGE: A Package for Automatic Evaluation of Summaries," in Proc. Workshop on Text Summarization Branches Out, pp. 74–81, 2004. Available: https://aclanthology.org/W04-1013
[13] K. Papineni et al., "BLEU: A Method for Automatic Evaluation of Machine Translation," in Proc. ACL 2002, pp. 311–318, 2002. [DOI: 10.3115/1073083.1073135]
[14] L. Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," in Proc. NeurIPS, vol. 36, 2023. Available: https://arxiv.org/abs/2306.05685
[15] K. Greshake et al., "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection," in Proc. AISec@CCS 2023, pp. 79–90, 2023. Available: https://arxiv.org/abs/2302.12173
[16] F. Perez and I. Ribeiro, "Ignore Previous Prompt: Attack Techniques For Language Models," arXiv preprint arXiv:2211.09527, 2022. Available: https://arxiv.org/abs/2211.09527