Skip to content

memvid/memvidbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MemvidBench

Benchmark tool for evaluating Memvid on the LoCoMo (Long-term Conversational Memory) benchmark.

Results

Memvid achieves 85.7% accuracy on LoCoMo - 28% higher than leading memory systems.

Category Accuracy Questions
Single-hop 80.14% 282
Multi-hop 80.37% 321
Temporal 71.88% 96
World-knowledge 91.08% 841
Adversarial 77.80% 446
Overall (Cat. 1-4) 85.65% 1,540

Following standard methodology, adversarial category is excluded from the primary metric.

Configuration

  • Judge Model: gpt-4o-mini (lenient grading)
  • Answering Model: gpt-4o
  • Embedding Model: text-embedding-3-large
  • Search Mode: Hybrid (BM25 + Semantic)
  • Retrieval K: 60

Quick Start

# Install
bun install

# Set environment variables
export OPENAI_API_KEY=sk-...
export MEMVID_API_KEY=mv2_...      # Get from memvid.dev
export MEMVID_MEMORY_ID=...        # Your memory ID

# Run full benchmark (~3 hours)
bun run bench:full

# Or quick test (100 questions, ~10 min)
bun run bench:quick

You can also create a .env file with these variables.

Commands

# Full benchmark (1,986 questions)
bun run src/index.ts run -r my-run --force

# Limit to N questions
bun run src/index.ts run -r quick -l 100

# Sample N per category
bun run src/index.ts run -r sample -s 25

# List questions
bun run src/index.ts list -l 20

# Resume interrupted run
bun run src/index.ts run -r my-run

Options

-r, --run-id     Run identifier (required)
-j, --judge      Judge model (default: gpt-4o-mini)
-m, --model      Answering model (default: gpt-4o)
-l, --limit      Limit total questions
-s, --sample     Sample N per category
-t, --types      Filter by types (comma-separated)
--force          Clear checkpoint and start fresh

Output

Results are saved to data/runs/{run-id}/:

data/runs/my-run/
├── checkpoint.json   # Full evaluation data
└── report.json       # Summary metrics

Methodology

  • Categories 1-4 accuracy (excludes adversarial)
  • Lenient LLM-as-judge grading
  • Standard evaluation prompt

Comparison

System LoCoMo Accuracy
Memvid 85.65%
Full-context 72.90%
Mem0ᵍ 68.44%
Mem0 66.88%
Zep 65.99%
LangMem 58.10%
OpenAI 52.90%

Baseline figures from arXiv:2504.19413. Some vendors dispute these results.

References

License

MIT

About

Benchmark tool for evaluating Memvid on the LoCoMo (Long-term Conversational Memory) benchmark.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published