Skip to content

Latest commit

 

History

History
69 lines (57 loc) · 2.49 KB

File metadata and controls

69 lines (57 loc) · 2.49 KB

Benchmarks

Benchmark dataset adapters. Each benchmark implements the Benchmark interface.

Interface

interface Benchmark {
    name: string
    load(config?: BenchmarkConfig): Promise<void>
    getQuestions(filter?: QuestionFilter): UnifiedQuestion[]
    getHaystackSessions(questionId: string): UnifiedSession[]
    getGroundTruth(questionId: string): string
    getQuestionTypes(): QuestionTypeRegistry
}

Adding a Benchmark

  1. Create src/benchmarks/mybenchmark/index.ts
  2. Implement Benchmark interface
  3. Register in src/benchmarks/index.ts
  4. Add to BenchmarkName type in src/types/benchmark.ts

Required returns:

  • load() - Parse data, populate internal maps
  • getQuestions() - Return UnifiedQuestion[] with filtering support
  • getHaystackSessions() - Return UnifiedSession[] for a question
  • getGroundTruth() - Return expected answer string
  • getQuestionTypes() - Return { [id]: { id, alias, description } }

Existing Benchmarks

Benchmark Source Description
locomo GitHub snap-research/locomo Long context memory benchmark
longmemeval HuggingFace xiaowu0162/longmemeval-cleaned Long-term memory evaluation
convomem HuggingFace Salesforce/ConvoMem Conversational memory benchmark

Question Types

LoCoMo

Type Alias Description
single-hop single Single-hop fact recall
multi-hop multi Multi-hop reasoning
temporal temporal Temporal reasoning
world-knowledge world Commonsense knowledge
adversarial adversarial Unanswerable questions

LongMemEval

Type Alias Description
single-session-user ss-user Single-session user facts
single-session-assistant ss-asst Single-session assistant facts
single-session-preference ss-pref Single-session preferences
multi-session multi Multi-session reasoning
temporal-reasoning temporal Temporal reasoning
knowledge-update update Knowledge update tracking

ConvoMem

Type Alias Description
user_evidence user User-stated facts
assistant_facts_evidence asst Assistant-stated facts
preference_evidence pref User preferences
changing_evidence change Information updates
implicit_connection_evidence implicit Implicit reasoning
abstention_evidence abstain Unanswerable questions