NeoBERT evaluation currently focuses on GLUE and MTEB.
python scripts/evaluation/run_glue.py configs/glue/cola.yamlbash scripts/evaluation/glue/run_quick_glue.sh configs/glue
bash scripts/evaluation/glue/run_all_glue.sh configs/glue- GLUE always runs with SDPA attention in classifier wrappers; non-SDPA
model.attn_backendrequests are normalized away with a warning. - Pretrained local checkpoints are required unless either
glue.allow_random_weights: trueormodel.from_hub: true. - GLUE checkpoints are written to
trainer.output_dir/checkpoints/<step>/. - Legacy
model_checkpoints/<step>/paths are still accepted when loading older artifacts. - Results are stored under
trainer.output_diras JSON metrics.
python scripts/evaluation/glue/summarize_glue.py outputs/glue/<run>bash scripts/evaluation/glue/build_configs.sh outputs/my_sweep my-tag \
--config-output-dir configs/glue/generated \
--tasks cola,qnlipython scripts/evaluation/run_mteb.py \
configs/pretraining/pretrain_neobert.yaml \
--model_name_or_path outputs/<pretrain_run>- Runner loads checkpoints from
<model_name_or_path>/checkpoints/. - Task family selection is read from config field
mteb_task_type. --task_typescan override config selection at launch time. Accepts categories (classification,retrieval,sts,all) and/or explicit task names (comma-separated).- Output path is currently derived from run dir + checkpoint + max length:
outputs/<run>/mteb/<ckpt>/<max_length>/. - If using a local tokenizer, point
tokenizer.nameto that path.
scripts/evaluation/pseudo_perplexity.py can load NeoBERT checkpoints from the
current portable step layout (checkpoints/<step>/model.safetensors) and falls
back to legacy DeepSpeed ZeRO conversion only when portable weights are absent.
That legacy fallback requires the optional neobert[legacy-checkpoints] extra.
When checkpoint_path points at a checkpoint root, --checkpoint latest
first honors a legacy DeepSpeed latest file when present; otherwise it
resolves to the newest loadable numbered step. If
checkpoint_path already points at a specific step directory, pass the matching
--checkpoint tag; explicit missing non-latest tags fail fast instead of
silently loading the direct path.
- Wrong checkpoint path
- verify
glue.pretrained_checkpoint_dir,glue.pretrained_checkpoint, andglue.pretrained_model_pathin GLUE configs.
- Flat/random GLUE metrics
- confirm pretrained weights were actually loaded (or intentionally set
allow_random_weights: true).
- OOM during eval
- reduce eval batch size and/or sequence length.
- Attention backend confusion
- GLUE path is SDPA-oriented; packed flash varlen is a training optimization.