Benchmark tool for evaluating Memvid on the LoCoMo (Long-term Conversational Memory) benchmark.
Memvid achieves 85.7% accuracy on LoCoMo - 28% higher than leading memory systems.
| Category | Accuracy | Questions |
|---|---|---|
| Single-hop | 80.14% | 282 |
| Multi-hop | 80.37% | 321 |
| Temporal | 71.88% | 96 |
| World-knowledge | 91.08% | 841 |
| Adversarial | 77.80% | 446 |
| Overall (Cat. 1-4) | 85.65% | 1,540 |
Following standard methodology, adversarial category is excluded from the primary metric.
- Judge Model: gpt-4o-mini (lenient grading)
- Answering Model: gpt-4o
- Embedding Model: text-embedding-3-large
- Search Mode: Hybrid (BM25 + Semantic)
- Retrieval K: 60
# Install
bun install
# Set environment variables
export OPENAI_API_KEY=sk-...
export MEMVID_API_KEY=mv2_... # Get from memvid.dev
export MEMVID_MEMORY_ID=... # Your memory ID
# Run full benchmark (~3 hours)
bun run bench:full
# Or quick test (100 questions, ~10 min)
bun run bench:quickYou can also create a .env file with these variables.
# Full benchmark (1,986 questions)
bun run src/index.ts run -r my-run --force
# Limit to N questions
bun run src/index.ts run -r quick -l 100
# Sample N per category
bun run src/index.ts run -r sample -s 25
# List questions
bun run src/index.ts list -l 20
# Resume interrupted run
bun run src/index.ts run -r my-run-r, --run-id Run identifier (required)
-j, --judge Judge model (default: gpt-4o-mini)
-m, --model Answering model (default: gpt-4o)
-l, --limit Limit total questions
-s, --sample Sample N per category
-t, --types Filter by types (comma-separated)
--force Clear checkpoint and start fresh
Results are saved to data/runs/{run-id}/:
data/runs/my-run/
├── checkpoint.json # Full evaluation data
└── report.json # Summary metrics
- Categories 1-4 accuracy (excludes adversarial)
- Lenient LLM-as-judge grading
- Standard evaluation prompt
| System | LoCoMo Accuracy |
|---|---|
| Memvid | 85.65% |
| Full-context | 72.90% |
| Mem0ᵍ | 68.44% |
| Mem0 | 66.88% |
| Zep | 65.99% |
| LangMem | 58.10% |
| OpenAI | 52.90% |
Baseline figures from arXiv:2504.19413. Some vendors dispute these results.
- LoCoMo Dataset
- LoCoMo Paper - Maharana et al., ACL 2024
MIT