MemvidBench

Benchmark tool for evaluating Memvid on the LoCoMo (Long-term Conversational Memory) benchmark.

Results

Memvid achieves 85.7% accuracy on LoCoMo - 28% higher than leading memory systems.

Category	Accuracy	Questions
Single-hop	80.14%	282
Multi-hop	80.37%	321
Temporal	71.88%	96
World-knowledge	91.08%	841
Adversarial	77.80%	446
Overall (Cat. 1-4)	85.65%	1,540

Following standard methodology, adversarial category is excluded from the primary metric.

Configuration

Judge Model: gpt-4o-mini (lenient grading)
Answering Model: gpt-4o
Embedding Model: text-embedding-3-large
Search Mode: Hybrid (BM25 + Semantic)
Retrieval K: 60

Quick Start

# Install
bun install

# Set environment variables
export OPENAI_API_KEY=sk-...
export MEMVID_API_KEY=mv2_...      # Get from memvid.dev
export MEMVID_MEMORY_ID=...        # Your memory ID

# Run full benchmark (~3 hours)
bun run bench:full

# Or quick test (100 questions, ~10 min)
bun run bench:quick

You can also create a .env file with these variables.

Commands

# Full benchmark (1,986 questions)
bun run src/index.ts run -r my-run --force

# Limit to N questions
bun run src/index.ts run -r quick -l 100

# Sample N per category
bun run src/index.ts run -r sample -s 25

# List questions
bun run src/index.ts list -l 20

# Resume interrupted run
bun run src/index.ts run -r my-run

Options

-r, --run-id     Run identifier (required)
-j, --judge      Judge model (default: gpt-4o-mini)
-m, --model      Answering model (default: gpt-4o)
-l, --limit      Limit total questions
-s, --sample     Sample N per category
-t, --types      Filter by types (comma-separated)
--force          Clear checkpoint and start fresh

Output

Results are saved to data/runs/{run-id}/:

data/runs/my-run/
├── checkpoint.json   # Full evaluation data
└── report.json       # Summary metrics

Methodology

Categories 1-4 accuracy (excludes adversarial)
Lenient LLM-as-judge grading
Standard evaluation prompt

Comparison

System	LoCoMo Accuracy
Memvid	85.65%
Full-context	72.90%
Mem0ᵍ	68.44%
Mem0	66.88%
Zep	65.99%
LangMem	58.10%
OpenAI	52.90%

Baseline figures from arXiv:2504.19413. Some vendors dispute these results.

References

LoCoMo Dataset
LoCoMo Paper - Maharana et al., ACL 2024

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
REPORT.md		REPORT.md
bun.lock		bun.lock
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MemvidBench

Results

Configuration

Quick Start

Commands

Options

Output

Methodology

Comparison

References

License

About

Uh oh!

Releases

Packages

Languages

memvid/memvidbench

Folders and files

Latest commit

History

Repository files navigation

MemvidBench

Results

Configuration

Quick Start

Commands

Options

Output

Methodology

Comparison

References

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages