Terminal-Bench Server

Terminal-Bench Server is a FastAPI service for running and evaluating Terminal-Bench 2.0 tasks. It uses the Terminus 2 agent to execute tasks inside Docker containers, then copies tests/ into /tests in the same task container, runs the task-provided /tests/test.sh, and determines pass or fail from /logs/verifier/reward.txt or /logs/verifier/reward.json.

Directory Structure

Terminal-Bench-server/
├── terminal_bench_service.py   # FastAPI main service
├── task_loader.py              # Task data loader
├── utils.py                    # Docker client and image cache
├── .env                        # Environment configuration
├── .env.example                # Example environment configuration
├── agents/
│   ├── terminal_bench_agent.py # mini-swe-agent runner (deprecated)
│   └── terminus_2_agent.py     # Terminus 2 agent runner
├── evaluators/
│   └── terminal_bench.py       # In-place verifier (runs /tests/test.sh and reads reward)
├── data/
│   └── terminal-bench-2-main/  # Task data directory
│       ├── build_images.sh     # Script to build all task images
│       ├── chess-best-move/
│       │   ├── instruction.md
│       │   ├── task.toml
│       │   └── tests/
│       ├── ...
├── requirements.txt
├── Dockerfile
└── README.md

Prerequisites

Python Environment
- Python 3.11+
- Docker daemon is running
Task Data Is Available
- Make sure Terminal-Bench-server/data/terminal-bench-2-main/ exists
- It should contain each task's instruction.md, task.toml, tests/, and related files

Installation

cd Terminal-Bench-server

# Install dependencies
pip install -r requirements.txt

# Copy the config file
cp .env.example .env
# Edit .env with your settings

Configuration

Configure via `.env` (Recommended)

Edit .env:

# Docker image configuration
# If DOCKER_REGISTRY is set, images from task.toml will be mapped to this registry first
# Example: registry.h.pjlab.org.cn/ailab-opencompass-opencompass_proxy/terminal_bench_2
# If not set, the service will prefer docker_image from task.toml directly
# (for example: alexgshaw/task-name:version)
DOCKER_REGISTRY=

# Whether to pull if the image is not available locally
# If set to false and the image is missing locally, the service will fail and ask you to build it manually
PULL_IF_NOT_EXISTS=true

# Network mode for task containers
# Common values: bridge, host, none, <custom-network-name>
CONTAINER_NETWORK_MODE=bridge

# Task data path (optional, defaults to ./data/terminal-bench-2-main)
TB2_DIR=

# Service configuration
LOG_LEVEL=INFO
THREAD_POOL_MAX_WORKERS=4
TIMEOUT_KEEP_ALIVE=5

Important Notes:

LLM configuration such as API key, model, and base URL is passed through the AgentCompass API request via llm_config
You do not need to put LLM settings in environment variables
Images must be built manually with build_images.sh; the service does not build images automatically

Image Resolution Priority

The service currently resolves images in the following order:

If DOCKER_REGISTRY is set and the task task.toml contains environment.docker_image:
- First try the mapped private image: DOCKER_REGISTRY:<task-name>-<version>
- Example: registry.h.pjlab.org.cn/.../terminal_bench_2:cancel-async-tasks-20251031
- If that fails, fall back to the original image from task.toml: alexgshaw/cancel-async-tasks:20251031
If DOCKER_REGISTRY is not set and task.toml contains environment.docker_image:
- Use the full image name from task.toml directly
If the task does not provide environment.docker_image:
- Fall back to the default registry: registry.h.pjlab.org.cn/ailab-opencompass-opencompass_proxy/terminal_bench_2:<task_name>
If the candidate image does not exist locally:
- Try pulling it if PULL_IF_NOT_EXISTS=true
- Fail immediately if PULL_IF_NOT_EXISTS=false

Evaluation Logic

The current evaluation logic matches Harbor:

Start the task container and run the agent inside it.
After the agent finishes, keep that container instead of starting a separate evaluation container.
Copy the task's tests/ directory into /tests inside the same container.
Run /tests/test.sh in that same container.
Read /logs/verifier/reward.txt or /logs/verifier/reward.json from the same container.
Mark the task as passed if reward >= 1; otherwise mark it as failed.

Docker Image Management

Build Images

cd Terminal-Bench-server/data/terminal-bench-2-main

# Build all task images (the version suffix is read automatically from task.toml)
bash build_images.sh

Built images use this format: registry.h.pjlab.org.cn/.../terminal_bench_2:<task-name>-<version>

Start the Service

Run Locally

python terminal_bench_service.py --host 0.0.0.0 --port 8080

You can also override uvicorn keep-alive explicitly if needed:

python terminal_bench_service.py --host 0.0.0.0 --port 8080 --timeout-keep-alive 5

Run with Docker

# Build the image
docker build -t terminal-bench-server .

# Run the container (Docker socket mount is required)
docker run -d \
  -p 8080:8080 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  --env-file .env \
  terminal-bench-server

API Usage

1. Health Check

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "service": "Terminal-Bench"
}

2. Run a Task

Full Request (manually provide `problem_statement`)

curl -X POST http://localhost:8080/api/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "params": {
      "metadata": {
        "task_name": "chess-best-move",
        "problem_statement": "The file chess_board.png has an image..."
      }
    },
    "llm_config": {
      "model_name": "gpt-4",
      "api_key": "your-api-key",
      "url": "https://api.openai.com/v1",
      "model_infer_params": {
        "temperature": 0.0,
        "top_p": 1.0
      }
    },
    "max_steps": 100
  }'

Simplified Request (auto-load `problem_statement`)

curl -X POST http://localhost:8080/api/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "params": {
      "metadata": {
        "task_name": "chess-best-move"
      }
    },
    "llm_config": {
      "model_name": "gpt-4",
      "api_key": "your-api-key"
    }
  }'

Response:

{
  "final_answer": "True",
  "trajectory": [
    {
      "step": 1,
      "role": "user",
      "content": "The file chess_board.png..."
    },
    {
      "step": 2,
      "role": "assistant",
      "content": "I'll analyze the chess board..."
    },
    ...
    {
      "step": 15,
      "action": "evaluation",
      "output": "{'resolved': True, 'output': '...'}"
    }
  ],
  "call_stat": {
    "model": "gpt-4",
    "api_calls": 12,
    "total_cost": 0.45
  }
}

Integration with AgentCompass

Terminal-Bench-server is designed to integrate directly with AgentCompass.

AgentCompass Request Example

curl -X POST "http://localhost:8001/api/tasks/batch" \
  -H "Content-Type: application/json" \
  -d '{
    "benchmark": "terminal_bench_2",
    "models": ["glm-4.7"],
    "params": {
      "benchmark_params": {
        "resume": true,
        "category": "all",
        "max_concurrency": 6,
        "k": 1,
        "avgk": true,
        "service_url": "http://localhost:8080/api/tasks",
        "request_timeout": 7200,
        "max_steps": 250,
        "limit": 0
      },
      "model_infer_params": {
        "temperature": 1
      }
    }
  }'

Request Format

The current terminal_bench_2 adapter in AgentCompass sends only one minimal task identifier:

params.task_id

For compatibility with older requests, the service also accepts these sources:

params.metadata.task_name
params.metadata.instance_id
params.metadata.terminal_bench_task_id
params.metadata.task_id
The corresponding params-level fields with the same names

The following example is closer to the current real request shape:

{
  "params": {
    "task_id": "chess-best-move"
  },
  "llm_config": {
    "model_name": "gpt-4",
    "api_key": "...",
    "url": "..."
  },
  "max_steps": 100
}

Response Fields

final_answer: "True" means the task passed, "False" means it failed
trajectory: the full execution trace, including agent steps and the final verifier output
call_stat: API call statistics such as number of calls and cost

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Terminal-Bench Server

Directory Structure

Prerequisites

Installation

Configuration

Configure via `.env` (Recommended)

Image Resolution Priority

Evaluation Logic

Docker Image Management

Build Images

Start the Service

Run Locally

Run with Docker

API Usage

1. Health Check

2. Run a Task

Full Request (manually provide `problem_statement`)

Simplified Request (auto-load `problem_statement`)

Integration with AgentCompass

AgentCompass Request Example

Request Format

Response Fields

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
agents		agents
data/terminal-bench-2-main		data/terminal-bench-2-main
evaluators		evaluators
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt
task_loader.py		task_loader.py
terminal_bench_service.py		terminal_bench_service.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Terminal-Bench Server

Directory Structure

Prerequisites

Installation

Configuration

Configure via .env (Recommended)

Image Resolution Priority

Evaluation Logic

Docker Image Management

Build Images

Start the Service

Run Locally

Run with Docker

API Usage

1. Health Check

2. Run a Task

Full Request (manually provide problem_statement)

Simplified Request (auto-load problem_statement)

Integration with AgentCompass

AgentCompass Request Example

Request Format

Response Fields

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Configure via `.env` (Recommended)

Full Request (manually provide `problem_statement`)

Simplified Request (auto-load `problem_statement`)

Packages