Evaluation Guide

NeoBERT evaluation currently focuses on GLUE and MTEB.

GLUE

Run one task

python scripts/evaluation/run_glue.py configs/glue/cola.yaml

Run quick/full suites

bash scripts/evaluation/glue/run_quick_glue.sh configs/glue
bash scripts/evaluation/glue/run_all_glue.sh configs/glue

Important GLUE behavior

GLUE always runs with SDPA attention in classifier wrappers; non-SDPA model.attn_backend requests are normalized away with a warning.
Pretrained local checkpoints are required unless either glue.allow_random_weights: true or model.from_hub: true.
GLUE checkpoints are written to trainer.output_dir/checkpoints/<step>/.
Legacy model_checkpoints/<step>/ paths are still accepted when loading older artifacts.
Results are stored under trainer.output_dir as JSON metrics.

Summarize GLUE outputs

python scripts/evaluation/glue/summarize_glue.py outputs/glue/<run>

Build generated GLUE configs from sweeps

bash scripts/evaluation/glue/build_configs.sh outputs/my_sweep my-tag \
  --config-output-dir configs/glue/generated \
  --tasks cola,qnli

MTEB

Run MTEB

python scripts/evaluation/run_mteb.py \
  configs/pretraining/pretrain_neobert.yaml \
  --model_name_or_path outputs/<pretrain_run>

Important MTEB behavior

Runner loads checkpoints from <model_name_or_path>/checkpoints/.
Task family selection is read from config field mteb_task_type.
--task_types can override config selection at launch time. Accepts categories (classification, retrieval, sts, all) and/or explicit task names (comma-separated).
Output path is currently derived from run dir + checkpoint + max length: outputs/<run>/mteb/<ckpt>/<max_length>/.
If using a local tokenizer, point tokenizer.name to that path.

Pseudo-Perplexity Utility

scripts/evaluation/pseudo_perplexity.py can load NeoBERT checkpoints from the current portable step layout (checkpoints/<step>/model.safetensors) and falls back to legacy DeepSpeed ZeRO conversion only when portable weights are absent. That legacy fallback requires the optional neobert[legacy-checkpoints] extra. When checkpoint_path points at a checkpoint root, --checkpoint latest first honors a legacy DeepSpeed latest file when present; otherwise it resolves to the newest loadable numbered step. If checkpoint_path already points at a specific step directory, pass the matching --checkpoint tag; explicit missing non-latest tags fail fast instead of silently loading the direct path.

Common Evaluation Pitfalls

Wrong checkpoint path

verify glue.pretrained_checkpoint_dir, glue.pretrained_checkpoint, and glue.pretrained_model_path in GLUE configs.

Flat/random GLUE metrics

confirm pretrained weights were actually loaded (or intentionally set allow_random_weights: true).

OOM during eval

reduce eval batch size and/or sequence length.

Attention backend confusion

GLUE path is SDPA-oriented; packed flash varlen is a training optimization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Guide

GLUE

Run one task

Run quick/full suites

Important GLUE behavior

Summarize GLUE outputs

Build generated GLUE configs from sweeps

MTEB

Run MTEB

Important MTEB behavior

Pseudo-Perplexity Utility

Common Evaluation Pitfalls

Related Docs

FilesExpand file tree

evaluation.md

Latest commit

History

evaluation.md

File metadata and controls

Evaluation Guide

GLUE

Run one task

Run quick/full suites

Important GLUE behavior

Summarize GLUE outputs

Build generated GLUE configs from sweeps

MTEB

Run MTEB

Important MTEB behavior

Pseudo-Perplexity Utility

Common Evaluation Pitfalls

Related Docs