Skip to content

open-compass/Terminal-Bench-server

Repository files navigation

Terminal-Bench Server

Terminal-Bench Server is a FastAPI service for running and evaluating Terminal-Bench 2.0 tasks. It uses the Terminus 2 agent to execute tasks inside Docker containers, then copies tests/ into /tests in the same task container, runs the task-provided /tests/test.sh, and determines pass or fail from /logs/verifier/reward.txt or /logs/verifier/reward.json.

Directory Structure

Terminal-Bench-server/
├── terminal_bench_service.py   # FastAPI main service
├── task_loader.py              # Task data loader
├── utils.py                    # Docker client and image cache
├── .env                        # Environment configuration
├── .env.example                # Example environment configuration
├── agents/
│   ├── terminal_bench_agent.py # mini-swe-agent runner (deprecated)
│   └── terminus_2_agent.py     # Terminus 2 agent runner
├── evaluators/
│   └── terminal_bench.py       # In-place verifier (runs /tests/test.sh and reads reward)
├── data/
│   └── terminal-bench-2-main/  # Task data directory
│       ├── build_images.sh     # Script to build all task images
│       ├── chess-best-move/
│       │   ├── instruction.md
│       │   ├── task.toml
│       │   └── tests/
│       ├── ...
├── requirements.txt
├── Dockerfile
└── README.md

Prerequisites

  1. Python Environment

    • Python 3.11+
    • Docker daemon is running
  2. Task Data Is Available

    • Make sure Terminal-Bench-server/data/terminal-bench-2-main/ exists
    • It should contain each task's instruction.md, task.toml, tests/, and related files

Installation

cd Terminal-Bench-server

# Install dependencies
pip install -r requirements.txt

# Copy the config file
cp .env.example .env
# Edit .env with your settings

Configuration

Configure via .env (Recommended)

Edit .env:

# Docker image configuration
# If DOCKER_REGISTRY is set, images from task.toml will be mapped to this registry first
# Example: registry.h.pjlab.org.cn/ailab-opencompass-opencompass_proxy/terminal_bench_2
# If not set, the service will prefer docker_image from task.toml directly
# (for example: alexgshaw/task-name:version)
DOCKER_REGISTRY=

# Whether to pull if the image is not available locally
# If set to false and the image is missing locally, the service will fail and ask you to build it manually
PULL_IF_NOT_EXISTS=true

# Network mode for task containers
# Common values: bridge, host, none, <custom-network-name>
CONTAINER_NETWORK_MODE=bridge

# Task data path (optional, defaults to ./data/terminal-bench-2-main)
TB2_DIR=

# Service configuration
LOG_LEVEL=INFO
THREAD_POOL_MAX_WORKERS=4
TIMEOUT_KEEP_ALIVE=5

Important Notes:

  • LLM configuration such as API key, model, and base URL is passed through the AgentCompass API request via llm_config
  • You do not need to put LLM settings in environment variables
  • Images must be built manually with build_images.sh; the service does not build images automatically

Image Resolution Priority

The service currently resolves images in the following order:

  1. If DOCKER_REGISTRY is set and the task task.toml contains environment.docker_image:

    • First try the mapped private image: DOCKER_REGISTRY:<task-name>-<version>
    • Example: registry.h.pjlab.org.cn/.../terminal_bench_2:cancel-async-tasks-20251031
    • If that fails, fall back to the original image from task.toml: alexgshaw/cancel-async-tasks:20251031
  2. If DOCKER_REGISTRY is not set and task.toml contains environment.docker_image:

    • Use the full image name from task.toml directly
  3. If the task does not provide environment.docker_image:

    • Fall back to the default registry: registry.h.pjlab.org.cn/ailab-opencompass-opencompass_proxy/terminal_bench_2:<task_name>
  4. If the candidate image does not exist locally:

    • Try pulling it if PULL_IF_NOT_EXISTS=true
    • Fail immediately if PULL_IF_NOT_EXISTS=false

Evaluation Logic

The current evaluation logic matches Harbor:

  1. Start the task container and run the agent inside it.
  2. After the agent finishes, keep that container instead of starting a separate evaluation container.
  3. Copy the task's tests/ directory into /tests inside the same container.
  4. Run /tests/test.sh in that same container.
  5. Read /logs/verifier/reward.txt or /logs/verifier/reward.json from the same container.
  6. Mark the task as passed if reward >= 1; otherwise mark it as failed.

Docker Image Management

Build Images

cd Terminal-Bench-server/data/terminal-bench-2-main

# Build all task images (the version suffix is read automatically from task.toml)
bash build_images.sh

Built images use this format: registry.h.pjlab.org.cn/.../terminal_bench_2:<task-name>-<version>

Start the Service

Run Locally

python terminal_bench_service.py --host 0.0.0.0 --port 8080

You can also override uvicorn keep-alive explicitly if needed:

python terminal_bench_service.py --host 0.0.0.0 --port 8080 --timeout-keep-alive 5

Run with Docker

# Build the image
docker build -t terminal-bench-server .

# Run the container (Docker socket mount is required)
docker run -d \
  -p 8080:8080 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  --env-file .env \
  terminal-bench-server

API Usage

1. Health Check

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "service": "Terminal-Bench"
}

2. Run a Task

Full Request (manually provide problem_statement)

curl -X POST http://localhost:8080/api/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "params": {
      "metadata": {
        "task_name": "chess-best-move",
        "problem_statement": "The file chess_board.png has an image..."
      }
    },
    "llm_config": {
      "model_name": "gpt-4",
      "api_key": "your-api-key",
      "url": "https://api.openai.com/v1",
      "model_infer_params": {
        "temperature": 0.0,
        "top_p": 1.0
      }
    },
    "max_steps": 100
  }'

Simplified Request (auto-load problem_statement)

curl -X POST http://localhost:8080/api/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "params": {
      "metadata": {
        "task_name": "chess-best-move"
      }
    },
    "llm_config": {
      "model_name": "gpt-4",
      "api_key": "your-api-key"
    }
  }'

Response:

{
  "final_answer": "True",
  "trajectory": [
    {
      "step": 1,
      "role": "user",
      "content": "The file chess_board.png..."
    },
    {
      "step": 2,
      "role": "assistant",
      "content": "I'll analyze the chess board..."
    },
    ...
    {
      "step": 15,
      "action": "evaluation",
      "output": "{'resolved': True, 'output': '...'}"
    }
  ],
  "call_stat": {
    "model": "gpt-4",
    "api_calls": 12,
    "total_cost": 0.45
  }
}

Integration with AgentCompass

Terminal-Bench-server is designed to integrate directly with AgentCompass.

AgentCompass Request Example

curl -X POST "http://localhost:8001/api/tasks/batch" \
  -H "Content-Type: application/json" \
  -d '{
    "benchmark": "terminal_bench_2",
    "models": ["glm-4.7"],
    "params": {
      "benchmark_params": {
        "resume": true,
        "category": "all",
        "max_concurrency": 6,
        "k": 1,
        "avgk": true,
        "service_url": "http://localhost:8080/api/tasks",
        "request_timeout": 7200,
        "max_steps": 250,
        "limit": 0
      },
      "model_infer_params": {
        "temperature": 1
      }
    }
  }'

Request Format

The current terminal_bench_2 adapter in AgentCompass sends only one minimal task identifier:

  • params.task_id

For compatibility with older requests, the service also accepts these sources:

  • params.metadata.task_name
  • params.metadata.instance_id
  • params.metadata.terminal_bench_task_id
  • params.metadata.task_id
  • The corresponding params-level fields with the same names

The following example is closer to the current real request shape:

{
  "params": {
    "task_id": "chess-best-move"
  },
  "llm_config": {
    "model_name": "gpt-4",
    "api_key": "...",
    "url": "..."
  },
  "max_steps": 100
}

Response Fields

  • final_answer: "True" means the task passed, "False" means it failed
  • trajectory: the full execution trace, including agent steps and the final verifier output
  • call_stat: API call statistics such as number of calls and cost

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors