AMBench is the public benchmark and leaderboard surface for agent-memory systems. The exe-os product repo should not carry public benchmark adapters/results beyond tiny product smoke tests.
harness/exe-os/— exe-os benchmark adapters copied out of the product repo.schemas/result.schema.json— canonical result metadata schema.scripts/miss_logger.py— normalizes benchmark misses into hard-negative JSONL.scripts/official_answer_judge.py— official-mode answer/judge candidate generator.scripts/export_for_exe_embeddings.py— exports hard negatives to Exe-Embedding-v1 train/val format.results/andreports/— public or publishable benchmark outputs.
Every result must state whether it is:
diagnostic_retrieval— internal retrieval quality, not directly comparable to official generated-answer leaderboards.official_answer— generated answer scored with benchmark-native or judge evaluator.official_task— task success/process score for agentic benchmarks like MemoryArena.
No SOTA claim should mix diagnostic retrieval numbers with official answer/task leaderboards.
Benchmark misses become Exe Embeddings v1 training data:
python3 scripts/miss_logger.py raw_misses.jsonl --out data/processed/hard_negatives.jsonl
python3 scripts/export_for_exe_embeddings.py data/processed/hard_negatives.jsonl --out-dir ../Exe-Embedding-v1/data/ambenchEach row should include query, positive evidence, hard negatives ranked above the positive, benchmark, category, and memory-length bucket.