evaluation-harness

Star

Here are 14 public repositories matching this topic...

najeed / ai-agent-eval-harness

Star

The open-source MultiAgentOps evaluation and verification harness for any industry business workflow.

Updated May 23, 2026
Python

tpertner / confess

Star

Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.

python yaml calibration alignment metamorphic-testing model-evaluation ai-safety red-teaming prompt-injection hallucination-detection llm-evals evaluation-harness

Updated Feb 22, 2026
Python

rafaelmaranon / alpamayo-trace

Star

VLA ≠ VLM. Side-by-side viewer running NVIDIA Alpamayo R1 (vision-language-action) alongside Qwen2.5-VL (vision-language) on the same 44-sec SF dashcam clip at 5 Hz. 220 paired traces. Surfaces what an action-trained model sees that a scene-trained model doesn't, and vice versa.

machine-learning robotics autonomous-vehicles vla self-driving vlm multimodal huggingface qwen vision-language-action end-to-end-driving alpamayo evaluation-harness av-evaluation

Updated May 8, 2026
HTML

tjkuhns / explodable

Star

AI content engine using an anxiety-indexed behavioral science KB, multi-stage LangGraph pipeline, and calibrated LLM-as-judge evaluation harness

python knowledge-graph claude rag streamlit ai-engineering supabase behavioral-science anthropic pgvector langgraph llm-as-judge evaluation-harness buyer-psychology

Updated Apr 19, 2026
Python

thylinao1 / TCM

Star

An LLM-powered training-evaluation platform that scores open-ended scenario responses 0 to 10 against rubrics, with an evaluation harness that benchmarks the AI scorer against human-labelled scores.

data-science cloud typescript json-schema zod vercel openai-api llm llm-evaluation llm-as-judge evaluation-harness

Updated May 23, 2026
TypeScript

Arnav-Ajay / rag-retrieval-eval

Star

A minimal, code-first retrieval observability harness that measures why RAG systems fail to surface relevant evidence, without changing retrieval or generation.

ai-systems failure-analysis rag-evaluation evaluation-harness retrieval-observability

Updated Jan 10, 2026
Python

EaCognitive / Metivta-Eval

Sponsor

Star

Evaluation harness for domain-specific RAG and QA systems with benchmark datasets, scoring, and regression workflows.

benchmarking domain-qa retrieval-augmented-generation llm-evaluation rag-evaluation evaluation-harness ai-evals

Updated Mar 8, 2026
Python

DaScient / OMEN

Star

jsp2195 / frontier-evals-harness

Star

frontier-evals-harness is a lightweight framework for benchmarking frontier language models. It provides deterministic suite versioning, modular adapters, standardized scoring, and paired statistical comparisons with confidence intervals. Built for regression tracking and analysis, it enables reproducible evaluation without infrastructure.

reproducible-research model-evaluation llm-evaluation llm-benchmarking statistical-evaluation evaluation-harness

Updated Feb 19, 2026
Python

SihyeonJeon / industrial-rag-gate

Star

Authority-aware RAG evaluation for industrial manual questions

retrieval rag industrial-ai llm-evaluation rag-evaluation safety-evaluation evaluation-harness citation-checking

Updated May 24, 2026
Python

LewallenAE / dv-eval-harness

Star

Production-shaped DV agent evaluation harness with simulator adapter boundary, trajectory scoring, reward decomposition, and JSONL trace persistence.

python eda ai-agents design-verification fastapi rlhf llm-evaluation evaluation-harness

Updated May 8, 2026
Python

Arnav-Ajay / rag-reranking-playground

Star

Controlled experiment isolating reranking as a first-class RAG system boundary, measuring how evidence priority—not recall—changes retrieval outcomes.

bm25 reranking rag failure-analysis hybrid-retrieval evaluation-harness

Updated Jan 23, 2026
Python

bnovik0v / ABC-GenBench

Star

Runnable benchmark toolkit for monophonic ABC melody generation and editing.

benchmark music-generation abc-notation controllable-generation symbolic-music generative-ai llm-evaluation evaluation-harness

Updated Apr 1, 2026
Python

reaatech / classifier-evals

Star

Offline classifier evaluation harness — dataset loader, confusion matrices, LLM-as-judge with cost accounting, regression gates for CI, Phoenix/Langfuse exporters. Built for intent classifiers but works on any classification task.

classifier typescript ci-cd testing-tools regression-testing confusion-matrix observability intent-classification mlops llm-eval langfuse llm-as-judge arize-phoenix agentic-ai evaluation-harness

Updated May 20, 2026
TypeScript

Improve this page

Add a description, image, and links to the evaluation-harness topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation-harness topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation-harness

Here are 14 public repositories matching this topic...

najeed / ai-agent-eval-harness

tpertner / confess

rafaelmaranon / alpamayo-trace

tjkuhns / explodable

thylinao1 / TCM

Arnav-Ajay / rag-retrieval-eval

EaCognitive / Metivta-Eval

DaScient / OMEN

jsp2195 / frontier-evals-harness

SihyeonJeon / industrial-rag-gate

LewallenAE / dv-eval-harness

Arnav-Ajay / rag-reranking-playground

bnovik0v / ABC-GenBench

reaatech / classifier-evals

Improve this page

Add this topic to your repo