Skip to content

Latest commit

 

History

History
102 lines (72 loc) · 3.15 KB

File metadata and controls

102 lines (72 loc) · 3.15 KB

Evaluation Guide

NeoBERT evaluation currently focuses on GLUE and MTEB.

GLUE

Run one task

python scripts/evaluation/run_glue.py configs/glue/cola.yaml

Run quick/full suites

bash scripts/evaluation/glue/run_quick_glue.sh configs/glue
bash scripts/evaluation/glue/run_all_glue.sh configs/glue

Important GLUE behavior

  • GLUE always runs with SDPA attention in classifier wrappers; non-SDPA model.attn_backend requests are normalized away with a warning.
  • Pretrained local checkpoints are required unless either glue.allow_random_weights: true or model.from_hub: true.
  • GLUE checkpoints are written to trainer.output_dir/checkpoints/<step>/.
  • Legacy model_checkpoints/<step>/ paths are still accepted when loading older artifacts.
  • Results are stored under trainer.output_dir as JSON metrics.

Summarize GLUE outputs

python scripts/evaluation/glue/summarize_glue.py outputs/glue/<run>

Build generated GLUE configs from sweeps

bash scripts/evaluation/glue/build_configs.sh outputs/my_sweep my-tag \
  --config-output-dir configs/glue/generated \
  --tasks cola,qnli

MTEB

Run MTEB

python scripts/evaluation/run_mteb.py \
  configs/pretraining/pretrain_neobert.yaml \
  --model_name_or_path outputs/<pretrain_run>

Important MTEB behavior

  • Runner loads checkpoints from <model_name_or_path>/checkpoints/.
  • Task family selection is read from config field mteb_task_type.
  • --task_types can override config selection at launch time. Accepts categories (classification, retrieval, sts, all) and/or explicit task names (comma-separated).
  • Output path is currently derived from run dir + checkpoint + max length: outputs/<run>/mteb/<ckpt>/<max_length>/.
  • If using a local tokenizer, point tokenizer.name to that path.

Pseudo-Perplexity Utility

scripts/evaluation/pseudo_perplexity.py can load NeoBERT checkpoints from the current portable step layout (checkpoints/<step>/model.safetensors) and falls back to legacy DeepSpeed ZeRO conversion only when portable weights are absent. That legacy fallback requires the optional neobert[legacy-checkpoints] extra. When checkpoint_path points at a checkpoint root, --checkpoint latest first honors a legacy DeepSpeed latest file when present; otherwise it resolves to the newest loadable numbered step. If checkpoint_path already points at a specific step directory, pass the matching --checkpoint tag; explicit missing non-latest tags fail fast instead of silently loading the direct path.

Common Evaluation Pitfalls

  1. Wrong checkpoint path
  • verify glue.pretrained_checkpoint_dir, glue.pretrained_checkpoint, and glue.pretrained_model_path in GLUE configs.
  1. Flat/random GLUE metrics
  • confirm pretrained weights were actually loaded (or intentionally set allow_random_weights: true).
  1. OOM during eval
  • reduce eval batch size and/or sequence length.
  1. Attention backend confusion
  • GLUE path is SDPA-oriented; packed flash varlen is a training optimization.

Related Docs