Terminal-Bench Server is a FastAPI service for running and evaluating Terminal-Bench 2.0 tasks. It uses the Terminus 2 agent to execute tasks inside Docker containers, then copies tests/ into /tests in the same task container, runs the task-provided /tests/test.sh, and determines pass or fail from /logs/verifier/reward.txt or /logs/verifier/reward.json.
Terminal-Bench-server/
├── terminal_bench_service.py # FastAPI main service
├── task_loader.py # Task data loader
├── utils.py # Docker client and image cache
├── .env # Environment configuration
├── .env.example # Example environment configuration
├── agents/
│ ├── terminal_bench_agent.py # mini-swe-agent runner (deprecated)
│ └── terminus_2_agent.py # Terminus 2 agent runner
├── evaluators/
│ └── terminal_bench.py # In-place verifier (runs /tests/test.sh and reads reward)
├── data/
│ └── terminal-bench-2-main/ # Task data directory
│ ├── build_images.sh # Script to build all task images
│ ├── chess-best-move/
│ │ ├── instruction.md
│ │ ├── task.toml
│ │ └── tests/
│ ├── ...
├── requirements.txt
├── Dockerfile
└── README.md
-
Python Environment
- Python 3.11+
- Docker daemon is running
-
Task Data Is Available
- Make sure
Terminal-Bench-server/data/terminal-bench-2-main/exists - It should contain each task's
instruction.md,task.toml,tests/, and related files
- Make sure
cd Terminal-Bench-server
# Install dependencies
pip install -r requirements.txt
# Copy the config file
cp .env.example .env
# Edit .env with your settingsEdit .env:
# Docker image configuration
# If DOCKER_REGISTRY is set, images from task.toml will be mapped to this registry first
# Example: registry.h.pjlab.org.cn/ailab-opencompass-opencompass_proxy/terminal_bench_2
# If not set, the service will prefer docker_image from task.toml directly
# (for example: alexgshaw/task-name:version)
DOCKER_REGISTRY=
# Whether to pull if the image is not available locally
# If set to false and the image is missing locally, the service will fail and ask you to build it manually
PULL_IF_NOT_EXISTS=true
# Network mode for task containers
# Common values: bridge, host, none, <custom-network-name>
CONTAINER_NETWORK_MODE=bridge
# Task data path (optional, defaults to ./data/terminal-bench-2-main)
TB2_DIR=
# Service configuration
LOG_LEVEL=INFO
THREAD_POOL_MAX_WORKERS=4
TIMEOUT_KEEP_ALIVE=5Important Notes:
- LLM configuration such as API key, model, and base URL is passed through the AgentCompass API request via
llm_config - You do not need to put LLM settings in environment variables
- Images must be built manually with
build_images.sh; the service does not build images automatically
The service currently resolves images in the following order:
-
If
DOCKER_REGISTRYis set and the tasktask.tomlcontainsenvironment.docker_image:- First try the mapped private image:
DOCKER_REGISTRY:<task-name>-<version> - Example:
registry.h.pjlab.org.cn/.../terminal_bench_2:cancel-async-tasks-20251031 - If that fails, fall back to the original image from
task.toml:alexgshaw/cancel-async-tasks:20251031
- First try the mapped private image:
-
If
DOCKER_REGISTRYis not set andtask.tomlcontainsenvironment.docker_image:- Use the full image name from
task.tomldirectly
- Use the full image name from
-
If the task does not provide
environment.docker_image:- Fall back to the default registry:
registry.h.pjlab.org.cn/ailab-opencompass-opencompass_proxy/terminal_bench_2:<task_name>
- Fall back to the default registry:
-
If the candidate image does not exist locally:
- Try pulling it if
PULL_IF_NOT_EXISTS=true - Fail immediately if
PULL_IF_NOT_EXISTS=false
- Try pulling it if
The current evaluation logic matches Harbor:
- Start the task container and run the agent inside it.
- After the agent finishes, keep that container instead of starting a separate evaluation container.
- Copy the task's
tests/directory into/testsinside the same container. - Run
/tests/test.shin that same container. - Read
/logs/verifier/reward.txtor/logs/verifier/reward.jsonfrom the same container. - Mark the task as passed if
reward >= 1; otherwise mark it as failed.
cd Terminal-Bench-server/data/terminal-bench-2-main
# Build all task images (the version suffix is read automatically from task.toml)
bash build_images.shBuilt images use this format: registry.h.pjlab.org.cn/.../terminal_bench_2:<task-name>-<version>
python terminal_bench_service.py --host 0.0.0.0 --port 8080You can also override uvicorn keep-alive explicitly if needed:
python terminal_bench_service.py --host 0.0.0.0 --port 8080 --timeout-keep-alive 5# Build the image
docker build -t terminal-bench-server .
# Run the container (Docker socket mount is required)
docker run -d \
-p 8080:8080 \
-v /var/run/docker.sock:/var/run/docker.sock \
--env-file .env \
terminal-bench-servercurl http://localhost:8080/healthResponse:
{
"status": "healthy",
"service": "Terminal-Bench"
}curl -X POST http://localhost:8080/api/tasks \
-H "Content-Type: application/json" \
-d '{
"params": {
"metadata": {
"task_name": "chess-best-move",
"problem_statement": "The file chess_board.png has an image..."
}
},
"llm_config": {
"model_name": "gpt-4",
"api_key": "your-api-key",
"url": "https://api.openai.com/v1",
"model_infer_params": {
"temperature": 0.0,
"top_p": 1.0
}
},
"max_steps": 100
}'curl -X POST http://localhost:8080/api/tasks \
-H "Content-Type: application/json" \
-d '{
"params": {
"metadata": {
"task_name": "chess-best-move"
}
},
"llm_config": {
"model_name": "gpt-4",
"api_key": "your-api-key"
}
}'Response:
{
"final_answer": "True",
"trajectory": [
{
"step": 1,
"role": "user",
"content": "The file chess_board.png..."
},
{
"step": 2,
"role": "assistant",
"content": "I'll analyze the chess board..."
},
...
{
"step": 15,
"action": "evaluation",
"output": "{'resolved': True, 'output': '...'}"
}
],
"call_stat": {
"model": "gpt-4",
"api_calls": 12,
"total_cost": 0.45
}
}Terminal-Bench-server is designed to integrate directly with AgentCompass.
curl -X POST "http://localhost:8001/api/tasks/batch" \
-H "Content-Type: application/json" \
-d '{
"benchmark": "terminal_bench_2",
"models": ["glm-4.7"],
"params": {
"benchmark_params": {
"resume": true,
"category": "all",
"max_concurrency": 6,
"k": 1,
"avgk": true,
"service_url": "http://localhost:8080/api/tasks",
"request_timeout": 7200,
"max_steps": 250,
"limit": 0
},
"model_infer_params": {
"temperature": 1
}
}
}'The current terminal_bench_2 adapter in AgentCompass sends only one minimal task identifier:
params.task_id
For compatibility with older requests, the service also accepts these sources:
params.metadata.task_nameparams.metadata.instance_idparams.metadata.terminal_bench_task_idparams.metadata.task_id- The corresponding params-level fields with the same names
The following example is closer to the current real request shape:
{
"params": {
"task_id": "chess-best-move"
},
"llm_config": {
"model_name": "gpt-4",
"api_key": "...",
"url": "..."
},
"max_steps": 100
}final_answer:"True"means the task passed,"False"means it failedtrajectory: the full execution trace, including agent steps and the final verifier outputcall_stat: API call statistics such as number of calls and cost