Benchmark dataset adapters. Each benchmark implements the Benchmark interface.
interface Benchmark {
name: string
load(config?: BenchmarkConfig): Promise<void>
getQuestions(filter?: QuestionFilter): UnifiedQuestion[]
getHaystackSessions(questionId: string): UnifiedSession[]
getGroundTruth(questionId: string): string
getQuestionTypes(): QuestionTypeRegistry
}
- Create
src/benchmarks/mybenchmark/index.ts
- Implement
Benchmark interface
- Register in
src/benchmarks/index.ts
- Add to
BenchmarkName type in src/types/benchmark.ts
Required returns:
load() - Parse data, populate internal maps
getQuestions() - Return UnifiedQuestion[] with filtering support
getHaystackSessions() - Return UnifiedSession[] for a question
getGroundTruth() - Return expected answer string
getQuestionTypes() - Return { [id]: { id, alias, description } }
| Benchmark |
Source |
Description |
locomo |
GitHub snap-research/locomo |
Long context memory benchmark |
longmemeval |
HuggingFace xiaowu0162/longmemeval-cleaned |
Long-term memory evaluation |
convomem |
HuggingFace Salesforce/ConvoMem |
Conversational memory benchmark |
| Type |
Alias |
Description |
single-hop |
single |
Single-hop fact recall |
multi-hop |
multi |
Multi-hop reasoning |
temporal |
temporal |
Temporal reasoning |
world-knowledge |
world |
Commonsense knowledge |
adversarial |
adversarial |
Unanswerable questions |
| Type |
Alias |
Description |
single-session-user |
ss-user |
Single-session user facts |
single-session-assistant |
ss-asst |
Single-session assistant facts |
single-session-preference |
ss-pref |
Single-session preferences |
multi-session |
multi |
Multi-session reasoning |
temporal-reasoning |
temporal |
Temporal reasoning |
knowledge-update |
update |
Knowledge update tracking |
| Type |
Alias |
Description |
user_evidence |
user |
User-stated facts |
assistant_facts_evidence |
asst |
Assistant-stated facts |
preference_evidence |
pref |
User preferences |
changing_evidence |
change |
Information updates |
implicit_connection_evidence |
implicit |
Implicit reasoning |
abstention_evidence |
abstain |
Unanswerable questions |