Skip to content
#

evaluation-harness

Here are 14 public repositories matching this topic...

VLA ≠ VLM. Side-by-side viewer running NVIDIA Alpamayo R1 (vision-language-action) alongside Qwen2.5-VL (vision-language) on the same 44-sec SF dashcam clip at 5 Hz. 220 paired traces. Surfaces what an action-trained model sees that a scene-trained model doesn't, and vice versa.

  • Updated May 8, 2026
  • HTML

frontier-evals-harness is a lightweight framework for benchmarking frontier language models. It provides deterministic suite versioning, modular adapters, standardized scoring, and paired statistical comparisons with confidence intervals. Built for regression tracking and analysis, it enables reproducible evaluation without infrastructure.

  • Updated Feb 19, 2026
  • Python

Offline classifier evaluation harness — dataset loader, confusion matrices, LLM-as-judge with cost accounting, regression gates for CI, Phoenix/Langfuse exporters. Built for intent classifiers but works on any classification task.

  • Updated May 20, 2026
  • TypeScript

Improve this page

Add a description, image, and links to the evaluation-harness topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation-harness topic, visit your repo's landing page and select "manage topics."

Learn more