ai-evals

Star

Here are 35 public repositories matching this topic...

solana8800 / langeval

Sponsor

Star

Evaluation Infrastructure for AI Agents

ai-evaluation agent-evaluation ai-evals

Updated Feb 25, 2026
TypeScript

aisa-group / InferenceBench

Star

Benchmarking Open-Ended Inference Optimization by AI Agents

benchmarks ai-safety vllm sglang claude-code codex-cli ai-evals ai-research-automation

Updated May 16, 2026
Python

productfoundry101 / ai-evals-bootcamp

Star

Learn to evaluate AI products for production — 21 hands-on lessons on evals, metrics, fairness, agents, red teaming, and release decisions for working PMs.

bootcamp red-teaming rag prompt-engineering llmops ai-product-management llm-evaluation claude-code ai-pm ai-evals

Updated May 6, 2026

yiouli / pixie-qa

Star

Agent skill for AI agent development

skill dev eval llm agent-skills ai-evals

Updated Apr 22, 2026
HTML

mohsinsheikhani / property-maintenance-agent

Star

Eval-first AI agent that triages property maintenance emails. The real work is the eval system around it: trace-driven error analysis, code graders and validated LLM-as-judge (TPR/TNR), component and end-to-end evals, a failure taxonomy, and a CI regression gate. LangGraph, FastAPI, Langfuse.

python evaluation openai ai-agents pydantic fastapi ai-engineering prompt-engineering llmops langfuse llm-evaluation langgraph llm-as-a-judge llm-observability agentic-ai context-engineering ai-evals

Updated May 21, 2026
Python

RafaelParonis / jailbench

Star

🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.

python flask analytics openai alignment model-evaluation ai-safety security-testing red-teaming model-robustness anthropic litellm content-safety llm-jailbreaks tool-calling llm-benchmark ai-evals textual-tui

Updated May 22, 2026
Python

vibheksoni / jailbench

Star

Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.

Updated Aug 12, 2025
Python

MohsinCreed / LangfuseOllama

Star

Free, local Langfuse OSS setup with Ollama for LLM evaluation, scoring, and datasets.

docker open-source self-hosted free no-cost local-llm ollama langfuse llm-evaluation prompt-evaluation offline-ai llm-as-judge llm-observability ai-evals

Updated Apr 13, 2026
TypeScript

vitron-ai / aip-foundry-themis-starter

Star

Unofficial TypeScript starter for deterministic local contract testing around Foundry-oriented workflows with Themis.

react typescript schema-validation themis contract-testing osdk developer-tooling agentic-workflows ai-evals foundry-workflows

Updated Mar 28, 2026
TypeScript

SuperfiedStudd / ai-evals-orchestration

Star

End-to-end AI evals orchestration platform for comparing LLM outputs across providers with transcription, structured logging, human review, and Supabase-backed decision tracking.

gemini openai multi-model transcription human-in-the-loop model-comparison supabase anthropic llm-evaluation ai-evals evaluation-pipeline

Updated Mar 10, 2026
TypeScript

ishtiaqrahman / capitalbench

Star

Offline, auditable benchmark for one-shot LLM market decisions.

finance benchmark reproducibility llm-evaluation ai-evals capitalbench

Updated May 19, 2026
Python

danielrosehill / Awesome-AI-Evaluations-Tools

Star

Collection of frameworks and tools for AI evalations, including tool-use, agentic AI, MCP, and multimodal

evaluations evals ai-evals

Updated May 18, 2026
Python

vineethcv / eval-engine

Star

Lightweight eval framework for LLMs & AI apps combining deterministic scoring, LLM-as-judge, and regression testing.

python testing evaluation openai llm ai-quality evals ai-evals ai-quality-assurance

Updated Apr 8, 2026
Python

vishal-labade / llm_exp_platform_v2

Star

Experimentation framework for LLM systems using simulated users, conversational behavioral metrics, and causal inference to evaluate prompt strategies, temperature, and model scaling.

experimentation causal-inference product-analytics llm-evaluation llm-benchmarking ai-evals

Updated Mar 8, 2026
Python

davidspiegs / adtech-eval-lab

Star

Harbor-format AI evaluation tasks for synthetic adtech revenue operations workflows

benchmarking adtech harbor synthetic-data revenue-operations ai-evals

Updated May 21, 2026
Python

majdukovic / job-radar

Star

AI-powered remote job aggregator with true-remote filtering, fuzzy skill match, and a working AI Evals Engineer portfolio (Hamel/Shreya methodology, calibrated judges, failure taxonomy).

typescript nextjs job-search posthog supabase ai-evaluation llm inngest prompt-engineering anthropic ai-evals

Updated May 14, 2026
HTML

IsaacCavallaro / agent-evals-workbench

Star

A lightweight workbench for dataset-driven agent and LLM evaluation.

python cli regression-testing llm-evals agent-evals openai-compatible ai-evals eval-harness

Updated May 1, 2026
Python

AlejandroFuentePinero / ai-jie

Star

LLM extraction pipeline for job postings; ~80% skill classification accuracy via chain-of-thought scaffolding.

structured-output pydantic ayncio prompt-engineering ai-evals

Updated Apr 9, 2026
Python

RamyaLakshmiKS / agentic_software_team

Star

Multi-agent system orchestrating an AI-driven software team using the Claude Agents SDK. Agents take on defined roles and collaborate autonomously on software tasks.

jira ai orchestration multi-agent confluence atlassian llm generative-ai anthropic llm-agents agentic-ai mcp-server ai-evals claude-agent-sdk

Updated Feb 4, 2026
Python

EaCognitive / Metivta-Eval

Sponsor

Star

Evaluation harness for domain-specific RAG and QA systems with benchmark datasets, scoring, and regression workflows.

benchmarking domain-qa retrieval-augmented-generation llm-evaluation rag-evaluation evaluation-harness ai-evals

Updated Mar 8, 2026
Python

Improve this page

Add a description, image, and links to the ai-evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-evals topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-evals

Here are 35 public repositories matching this topic...

solana8800 / langeval

aisa-group / InferenceBench

productfoundry101 / ai-evals-bootcamp

yiouli / pixie-qa

mohsinsheikhani / property-maintenance-agent

RafaelParonis / jailbench

vibheksoni / jailbench

MohsinCreed / LangfuseOllama

vitron-ai / aip-foundry-themis-starter

SuperfiedStudd / ai-evals-orchestration

ishtiaqrahman / capitalbench

danielrosehill / Awesome-AI-Evaluations-Tools

vineethcv / eval-engine

vishal-labade / llm_exp_platform_v2

davidspiegs / adtech-eval-lab

majdukovic / job-radar

IsaacCavallaro / agent-evals-workbench

AlejandroFuentePinero / ai-jie

RamyaLakshmiKS / agentic_software_team

EaCognitive / Metivta-Eval

Improve this page

Add this topic to your repo