-
-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Benchmark Strategy: Obsidian Vault Intelligence
Goal
Implement a Standardized RAG Benchmark that replicates the "Benchmark B: Unified Corpus" methodology from obsidian-sonar. This allows direct performance comparison (Accuracy, Retrieval Quality, Latency) using the same dataset (CRAG).
Methodology: "Virtual Vault" Benchmarking
To test against the CRAG Unified Corpus (~60,000 documents) without flooding the user's actual Obsidian Vault with thousands of markdown files, we will implement a Virtual Indexing strategy.
1. Data Source (CRAG)
The benchmark requires two standardized files (compatible with obsidian-sonar format):
corpus.jsonl: ~60k entries. each line{ "url": "...", "content": "..." }.queries.jsonl: 100+ sampled queries.{ "query": "...", "answer": "...", "gold_urls": [...] }.
Note: We will provide a script or instruction to download/generate these using the official CRAG scripts, ensuring 1:1 data parity.
2. Implementation: BenchmarkService
A. Virtual Indexing (Ingestion)
Instead of creating real TFiles, the BenchmarkService will:
- Read
corpus.jsonlstream. - Feed the content directly into
GraphService/IndexerWorkerusing Virtual Paths (e.g.,benchmark/crag/doc_123). - The
IndexerWorkerwill treat these as valid indexed nodes, creating embeddings and keyword indices in the vector database. - Optimization: This "Benchmarking Index" should strictly be temporary or separate from the main vault index to avoid polluting the user's personal graph. We will implement a
GraphService.switchToBenchmarkMode()for this.
B. Execution Loop
For each query in queries.jsonl:
- Retrieval: Call
SearchOrchestrator.search(query). - Scoring (Retrieval):
- Check if the retrieved virtual paths match the
gold_urlsfrom the dataset. - Calculate Recall@5, Recall@10, MRR.
- Check if the retrieved virtual paths match the
- Generation (End-to-End) (Optional/Phase 2):
- Send retrieved context to
GeminiService. - Generate Answer.
- Compare with Ground Truth using LLM-as-a-Judge (Accuracy).
- Send retrieved context to
3. Reporting
Generate a Benchmark_Results.md report that matches the obsidian-sonar format for direct comparison:
| Metric | Vault Intelligence | Obsidian Sonar (Ref) | Diff |
|---|---|---|---|
| Retrieval Recall@5 | [Result] | N/A | - |
| Indexing Time | [Time] | 6,245s | - |
| Query Latency | [Time] | 33.5s | - |
Proposed Changes
[NEW] src/services/BenchmarkService.ts
loadCorpus(path: string)runCRAGBenchmark()calculateMetrics()
[MODIFY] src/services/GraphService.ts
- Add
insertVirtualFile(path, content): Allow indexing content without a physicalTFile. - Add
resetIndex(namespace): Ability to clear/switch indices.
[MODIFY] src/workers/indexer.worker.ts
- Ensure metadata/mtime checks can handle virtual inputs (timestamp = 0).
User Review Required
- Resource Intensity: Indexing 60k documents (even virtually) takes significant time and RAM.
- Storage: The vector store will grow significantly. We must ensure we can cleanly wipe the benchmark data afterwards.