Evaluation Infrastructure for AI Agents
-
Updated
Feb 25, 2026 - TypeScript
Evaluation Infrastructure for AI Agents
Benchmarking Open-Ended Inference Optimization by AI Agents
Learn to evaluate AI products for production — 21 hands-on lessons on evals, metrics, fairness, agents, red teaming, and release decisions for working PMs.
Eval-first AI agent that triages property maintenance emails. The real work is the eval system around it: trace-driven error analysis, code graders and validated LLM-as-judge (TPR/TNR), component and end-to-end evals, a failure taxonomy, and a CI regression gate. LangGraph, FastAPI, Langfuse.
🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.
Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.
Free, local Langfuse OSS setup with Ollama for LLM evaluation, scoring, and datasets.
Unofficial TypeScript starter for deterministic local contract testing around Foundry-oriented workflows with Themis.
End-to-end AI evals orchestration platform for comparing LLM outputs across providers with transcription, structured logging, human review, and Supabase-backed decision tracking.
Offline, auditable benchmark for one-shot LLM market decisions.
Collection of frameworks and tools for AI evalations, including tool-use, agentic AI, MCP, and multimodal
Lightweight eval framework for LLMs & AI apps combining deterministic scoring, LLM-as-judge, and regression testing.
Experimentation framework for LLM systems using simulated users, conversational behavioral metrics, and causal inference to evaluate prompt strategies, temperature, and model scaling.
Harbor-format AI evaluation tasks for synthetic adtech revenue operations workflows
AI-powered remote job aggregator with true-remote filtering, fuzzy skill match, and a working AI Evals Engineer portfolio (Hamel/Shreya methodology, calibrated judges, failure taxonomy).
A lightweight workbench for dataset-driven agent and LLM evaluation.
LLM extraction pipeline for job postings; ~80% skill classification accuracy via chain-of-thought scaffolding.
Multi-agent system orchestrating an AI-driven software team using the Claude Agents SDK. Agents take on defined roles and collaborate autonomously on software tasks.
Evaluation harness for domain-specific RAG and QA systems with benchmark datasets, scoring, and regression workflows.
Add a description, image, and links to the ai-evals topic page so that developers can more easily learn about it.
To associate your repository with the ai-evals topic, visit your repo's landing page and select "manage topics."