AMBench — Agent Memory Benchmark

AMBench is the public benchmark and leaderboard surface for agent-memory systems. The exe-os product repo should not carry public benchmark adapters/results beyond tiny product smoke tests.

What lives here

harness/exe-os/ — exe-os benchmark adapters copied out of the product repo.
schemas/result.schema.json — canonical result metadata schema.
scripts/miss_logger.py — normalizes benchmark misses into hard-negative JSONL.
scripts/official_answer_judge.py — official-mode answer/judge candidate generator.
scripts/export_for_exe_embeddings.py — exports hard negatives to Exe-Embedding-v1 train/val format.
results/ and reports/ — public or publishable benchmark outputs.

Result policy

Every result must state whether it is:

diagnostic_retrieval — internal retrieval quality, not directly comparable to official generated-answer leaderboards.
official_answer — generated answer scored with benchmark-native or judge evaluator.
official_task — task success/process score for agentic benchmarks like MemoryArena.

No SOTA claim should mix diagnostic retrieval numbers with official answer/task leaderboards.

Hard-negative loop

Benchmark misses become Exe Embeddings v1 training data:

python3 scripts/miss_logger.py raw_misses.jsonl --out data/processed/hard_negatives.jsonl
python3 scripts/export_for_exe_embeddings.py data/processed/hard_negatives.jsonl --out-dir ../Exe-Embedding-v1/data/ambench

Each row should include query, positive evidence, hard negatives ranked above the positive, benchmark, category, and memory-length bucket.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
harness/exe-os		harness/exe-os
reports		reports
results		results
schemas		schemas
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AMBench — Agent Memory Benchmark

What lives here

Result policy

Hard-negative loop

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AMBench — Agent Memory Benchmark

What lives here

Result policy

Hard-negative loop

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages