Reproducible entity-linking benchmarks with LamAPI retrieval and multiple runners.
- Configure
.env:
ENTITY_RETRIEVAL_ENDPOINT=...
ENTITY_RETRIEVAL_TOKEN=...
- Build datasets:
make build-datasets- Run a smoke test:
make run-editsim DATASET=mv MAX_ROWS=5 NIL_THRESHOLD=0.2 FORCE_GT=1- Evaluate:
make eval PRED=outputs/mv/editsim/<hash>/predictions.csv GT=data/mv/gt.csvmake run-llm DATASET=mv MAX_ROWS=5 MODEL=gpt-oss-120b
make run-crocodile DATASET=mv MAX_ROWS=5
make run-alligator DATASET=mv MAX_ROWS=5
make run-editsim DATASET=mv MAX_ROWS=5 NIL_THRESHOLD=0.2Common flags (all runners):
--max-rowsfor smoke tests--force-gt-candidateto force GT ids into candidate sets--force-id Qxxxx(repeatable) to add extra forced ids
Makefile equivalents:
FORCE_GT=1FORCE_ID="Q1 Q2"
Outputs:
outputs/{dataset}/{method}/{settings_hash}/predictions.csvoutputs/{dataset}/{method}/{settings_hash}/report.json
Frozen datasets live in:
data/mv/data/cp/data/sn/