Skip to content

Benchmark: Implement Standardized CRAG Benchmark Suite #179

@cybaea

Description

@cybaea

Benchmark Strategy: Obsidian Vault Intelligence

Goal

Implement a Standardized RAG Benchmark that replicates the "Benchmark B: Unified Corpus" methodology from obsidian-sonar. This allows direct performance comparison (Accuracy, Retrieval Quality, Latency) using the same dataset (CRAG).

Methodology: "Virtual Vault" Benchmarking

To test against the CRAG Unified Corpus (~60,000 documents) without flooding the user's actual Obsidian Vault with thousands of markdown files, we will implement a Virtual Indexing strategy.

1. Data Source (CRAG)

The benchmark requires two standardized files (compatible with obsidian-sonar format):

  • corpus.jsonl: ~60k entries. each line { "url": "...", "content": "..." }.
  • queries.jsonl: 100+ sampled queries. { "query": "...", "answer": "...", "gold_urls": [...] }.

Note: We will provide a script or instruction to download/generate these using the official CRAG scripts, ensuring 1:1 data parity.

2. Implementation: BenchmarkService

A. Virtual Indexing (Ingestion)

Instead of creating real TFiles, the BenchmarkService will:

  1. Read corpus.jsonl stream.
  2. Feed the content directly into GraphService / IndexerWorker using Virtual Paths (e.g., benchmark/crag/doc_123).
  3. The IndexerWorker will treat these as valid indexed nodes, creating embeddings and keyword indices in the vector database.
  4. Optimization: This "Benchmarking Index" should strictly be temporary or separate from the main vault index to avoid polluting the user's personal graph. We will implement a GraphService.switchToBenchmarkMode() for this.

B. Execution Loop

For each query in queries.jsonl:

  1. Retrieval: Call SearchOrchestrator.search(query).
  2. Scoring (Retrieval):
    • Check if the retrieved virtual paths match the gold_urls from the dataset.
    • Calculate Recall@5, Recall@10, MRR.
  3. Generation (End-to-End) (Optional/Phase 2):
    • Send retrieved context to GeminiService.
    • Generate Answer.
    • Compare with Ground Truth using LLM-as-a-Judge (Accuracy).

3. Reporting

Generate a Benchmark_Results.md report that matches the obsidian-sonar format for direct comparison:

Metric Vault Intelligence Obsidian Sonar (Ref) Diff
Retrieval Recall@5 [Result] N/A -
Indexing Time [Time] 6,245s -
Query Latency [Time] 33.5s -

Proposed Changes

[NEW] src/services/BenchmarkService.ts

  • loadCorpus(path: string)
  • runCRAGBenchmark()
  • calculateMetrics()

[MODIFY] src/services/GraphService.ts

  • Add insertVirtualFile(path, content): Allow indexing content without a physical TFile.
  • Add resetIndex(namespace): Ability to clear/switch indices.

[MODIFY] src/workers/indexer.worker.ts

  • Ensure metadata/mtime checks can handle virtual inputs (timestamp = 0).

User Review Required

  • Resource Intensity: Indexing 60k documents (even virtually) takes significant time and RAM.
  • Storage: The vector store will grow significantly. We must ensure we can cleanly wipe the benchmark data afterwards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: docsImprovements or additions to documentation.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions