Benchmarks

Benchmark dataset adapters. Each benchmark implements the Benchmark interface.

Interface

interface Benchmark {
    name: string
    load(config?: BenchmarkConfig): Promise<void>
    getQuestions(filter?: QuestionFilter): UnifiedQuestion[]
    getHaystackSessions(questionId: string): UnifiedSession[]
    getGroundTruth(questionId: string): string
    getQuestionTypes(): QuestionTypeRegistry
}

Adding a Benchmark

Create src/benchmarks/mybenchmark/index.ts
Implement Benchmark interface
Register in src/benchmarks/index.ts
Add to BenchmarkName type in src/types/benchmark.ts

Required returns:

load() - Parse data, populate internal maps
getQuestions() - Return UnifiedQuestion[] with filtering support
getHaystackSessions() - Return UnifiedSession[] for a question
getGroundTruth() - Return expected answer string
getQuestionTypes() - Return { [id]: { id, alias, description } }

Existing Benchmarks

Benchmark	Source	Description
`locomo`	GitHub snap-research/locomo	Long context memory benchmark
`longmemeval`	HuggingFace xiaowu0162/longmemeval-cleaned	Long-term memory evaluation
`convomem`	HuggingFace Salesforce/ConvoMem	Conversational memory benchmark

Question Types

LoCoMo

Type	Alias	Description
`single-hop`	single	Single-hop fact recall
`multi-hop`	multi	Multi-hop reasoning
`temporal`	temporal	Temporal reasoning
`world-knowledge`	world	Commonsense knowledge
`adversarial`	adversarial	Unanswerable questions

LongMemEval

Type	Alias	Description
`single-session-user`	ss-user	Single-session user facts
`single-session-assistant`	ss-asst	Single-session assistant facts
`single-session-preference`	ss-pref	Single-session preferences
`multi-session`	multi	Multi-session reasoning
`temporal-reasoning`	temporal	Temporal reasoning
`knowledge-update`	update	Knowledge update tracking

ConvoMem

Type	Alias	Description
`user_evidence`	user	User-stated facts
`assistant_facts_evidence`	asst	Assistant-stated facts
`preference_evidence`	pref	User preferences
`changing_evidence`	change	Information updates
`implicit_connection_evidence`	implicit	Implicit reasoning
`abstention_evidence`	abstain	Unanswerable questions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks

Interface

Adding a Benchmark

Existing Benchmarks

Question Types

LoCoMo

LongMemEval

ConvoMem

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Benchmarks

Interface

Adding a Benchmark

Existing Benchmarks

Question Types

LoCoMo

LongMemEval

ConvoMem