Harbor is a framework for running agent evaluations and creating and using RL environments.
-
Updated
May 16, 2026 - Python
Harbor is a framework for running agent evaluations and creating and using RL environments.
Official Implementation of "CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion"
Spoox CLI - Terminal Agent - SPlit lOOp eXand agent
Trajectories for running OpenHands on Terminal Bench
Fast, Multi-Cloud Sandboxes for AI Agents.
Codex-style REPL for terminal-agent models trained with camel-ai TerminalToolkit / terminal-bench terminus-2 protocols. Built to drive HansBug/OpenClaw-RL checkpoints.
reproducible Terminal-Bench task that evaluates a Bash script for parsing log files.
Autonomous coding agent for Terminal-Bench 2.0 — 61.6% ± 1.9 on the official leaderboard with GPT-5.1 Codex Mini.
⚓ Harbor benchmark adapter for running AgentPlane as a reproducible coding-agent harness with evidence artifacts.
Multi-layer audit framework for agent benchmark integrity
Central repository for Project Terminus task submissions. These environments and Oracle solutions are designed to evaluate state-of-the-art AI agents (like GPT-5.2 and Claude Opus 4.6) on complex, multi-step engineering challenges within a sandboxed terminal.
Transparent benchmark results for Vorflux — SWE-bench Verified (91%) & Terminal Bench 2 (86%)
Multi-agent reasoning MCP server for Claude Code. Spawns parallel research agents to find knowledge LLMs don't have. +23.1% on Terminal Bench 2.0 SWE tasks.
Add a description, image, and links to the terminal-bench topic page so that developers can more easily learn about it.
To associate your repository with the terminal-bench topic, visit your repo's landing page and select "manage topics."