terminal-bench

Codex-style REPL for terminal-agent models trained with camel-ai TerminalToolkit / terminal-bench terminus-2 protocols. Built to drive HansBug/OpenClaw-RL checkpoints.

repl terminus openai-compatible camel-ai qwen3 terminal-agent terminal-bench openclaw-rl

Updated May 12, 2026
Python

ayush0824 / parse-log-stats

Star

reproducible Terminal-Bench task that evaluates a Bash script for parsing log files.

python llm agentic-ai terminal-bench

Updated Apr 6, 2026
Dockerfile

sady4850 / hookele_coding_agent

Star

Autonomous coding agent for Terminal-Bench 2.0 — 61.6% ± 1.9 on the official leaderboard with GPT-5.1 Codex Mini.

benchmark openai autonomous-agent ai-agent llm-agent coding-agent terminal-bench

Updated May 16, 2026
Python

basilisk-labs / agentplane-harbor-adapter

Star

⚓ Harbor benchmark adapter for running AgentPlane as a reproducible coding-agent harness with evidence artifacts.

python benchmark evaluation developer-tools harbor ai-agents coding-agents terminal-bench agentplane

Updated May 3, 2026
Python

tianyi-zhang-02 / trajaudit

Star

Multi-layer audit framework for agent benchmark integrity

llm-agents llm-as-judge agent-evaluation swe-bench reward-hacking terminal-bench benchmark-auditing

Updated May 15, 2026
Python

Central repository for Project Terminus task submissions. These environments and Oracle solutions are designed to evaluate state-of-the-art AI agents (like GPT-5.2 and Claude Opus 4.6) on complex, multi-step engineering challenges within a sandboxed terminal.

ai-agents llm-evaluation cli-automation terminal-bench project-terminus snorkel-ai

Updated May 13, 2026
C++

piyushhhxyz / vorflux-swe-benchmarks

Star

Transparent benchmark results for Vorflux — SWE-bench Verified (91%) & Terminal Bench 2 (86%)

evaluation benchmarks ai-agent swe-bench terminal-bench vorflux

Updated May 14, 2026
JavaScript

sam-siavoshian / Symposium

Star

Multi-agent reasoning MCP server for Claude Code. Spawns parallel research agents to find knowledge LLMs don't have. +23.1% on Terminal Bench 2.0 SWE tasks.

research mcp multi-agent developer-tools ai-agents claude training-data fine-tuning nia llm model-context-protocol mcp-server claude-code terminal-bench

Updated Apr 13, 2026
TypeScript

Improve this page

Add a description, image, and links to the terminal-bench topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the terminal-bench topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

terminal-bench

Here are 13 public repositories matching this topic...

harbor-framework / harbor

LiberCoders / CLI-Gym

plaume8 / spoox

li-boxuan / Terminal-bench-OpenHands-trajectories

scitix / Agent-Sandbox

HansBug / oc-repl

ayush0824 / parse-log-stats

sady4850 / hookele_coding_agent

basilisk-labs / agentplane-harbor-adapter

tianyi-zhang-02 / trajaudit

mtepenner / snorkel-tasks

piyushhhxyz / vorflux-swe-benchmarks

sam-siavoshian / Symposium

Improve this page

Add this topic to your repo