A self-contained, minimal example for running Toolathlon benchmark tasks end-to-end using the Klavis AI Sandbox API and the OpenAI Agents SDK.
- Toolathlon Task Runner — Minimal Working Example
- Table of Contents
- Agent Framework & Customization
- How It Works
- Architecture Diagram
- Prerequisites
- Installation
- Quick Start
- Parallel Execution
- Task Directory Structure
- Key Concepts
- Credential Injection & Network/File Hijacking
- Why This Is Needed
- The Five Credential Injection & Adaptation Strategies
- Strategy 1: Environment Variables (Direct)
- Strategy 2: Network Hijack (Socket Monkeypatch)
- Strategy 3: File Hijack (open/stat Monkeypatch)
- How
_build_subprocess_envTies It Together - Implementing This Yourself
- Special Case: Handling Notion Tasks
- Timezone Handling (UTC Requirement)
- Project Structure
This repository uses the OpenAI Agents SDK as its default agent orchestration layer for convenience. Nothing in the benchmark depends on this SDK — it is a reference implementation that you should replace with your own agent infrastructure.
To swap in your own framework, replace the agent execution step (step 5 in How It Works): substitute the Runner.run() / Agent() calls with your own agent loop. Any framework that can call MCP tools over Streamable HTTP will work (LangChain, LlamaIndex, AutoGen, custom implementations, etc.). All other pipeline components — sandbox lifecycle, credential injection, evaluation — are framework-agnostic and remain unchanged.
The runner automates the full lifecycle of a Toolathlon benchmark task:
- Load the task definition (prompt, system prompt, required MCP servers) from disk.
- Acquire remote sandbox environments from Klavis AI — each providing MCP tool servers (filesystem, terminal, git, etc.).
⚠️ IMPORTANT — Log MCP Server URLs: After acquiring all sandboxes, always log/print every MCP server URL returned by the Klavis API (for both Local and Individual sandboxes). These URLs are essential for debugging connectivity issues, tool call failures, and sandbox lifecycle problems. Include them in your run logs so they can be referenced later.
- Preprocess — run any task-specific setup scripts to prepare the initial workspace (needs sandbox auth credentials from step 2).
- Upload the initial workspace tarball to the remote sandbox.
- Run the agent — an LLM-powered agent uses MCP tools via the sandbox to complete the task.
- Download the resulting workspace from the sandbox.
- Evaluate — run the task's evaluation script to check correctness (PASS/FAIL).
- Cleanup — release all sandbox resources.
┌─────────────────────────────────────────────────────────────────────────┐
│ Your Machine (Local) │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ toolathlon_task_run_example.py (main process) │ │
│ │ │ │
│ │ 1. load_task() ─── reads task config & prompts │ │
│ │ │ │
│ │ 2. KlavisSandbox ─── acquires sandboxes, extracts auth_data │ │
│ │ acquire_for_ ┌──────────────────────────────────────┐ │ │
│ │ servers() │ auth_env dict (built from auth_data) │ │ │
│ │ │ │ │ │
│ │ │ • KLAVIS_GITHUB_TOKEN, ... │ │ │
│ │ │ • HIJACK_IMAP_HOST/PORT, ... │ │ │
│ │ │ • HIJACK_GOOGLE_CREDENTIALS_PATH, .. │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ │ │ │
│ │ 3. run_preprocess() ─── subprocess ── (auth_env injected) │ │
│ │ ┌───────────────────────────────────────────────────────────┐ │ │
│ │ │ SUBPROCESS: preprocess/main.py │ │ │
│ │ │ │ │ │
│ │ │ _hijack/sitecustomize.py auto-loaded via PYTHONPATH │ │ │
│ │ │ (only when HIJACK_* env vars are present) │ │ │
│ │ │ • socket.getaddrinfo patched → redirect IMAP/SMTP │ │ │
│ │ │ • requests/aiohttp patched → rewrite Canvas URLs │ │ │
│ │ │ • builtins.open/os.stat patched → redirect cred files │ │ │
│ │ └───────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ 4. upload_workspace() ─── sends tar.gz to sandbox │ │
│ │ 5. Agent + Runner ─── LLM ↔ Klavis MCP (direct HTTP, no hijack) │ |
│ │ 6. download_workspace() ─── retrieves results │ │
│ │ │ │
│ │ 7. evaluate() ─── subprocess (auth_env injected) │ │
│ │ ┌───────────────────────────────────────────────────────────┐ │ │
│ │ │ SUBPROCESS: evaluation/main.py │ │ │
│ │ │ │ │ │
│ │ │ _hijack/sitecustomize.py auto-loaded via PYTHONPATH │ │ │
│ │ │ (same mechanism as preprocess — only when needed) │ │ │
│ │ │ • socket.getaddrinfo patched → redirect IMAP/SMTP │ │ │
│ │ │ • requests/aiohttp patched → rewrite Canvas URLs │ │ │
│ │ │ • builtins.open/os.stat patched → redirect cred files │ │ │
│ │ └───────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ 8. release_all() ─── cleanup sandboxes + temp files │ │
│ └──────────┬──────────────────────┬─────────────────────────────────┘ │
│ │ │ │
│ LLM API calls MCP tool calls │
│ │ │ │
└─────────────┼──────────────────────┼────────────────────────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌─────────────────────────────────────┐
│ LLM Provider │ │ Klavis AI Sandbox Cloud │
│ │ │ │
│ Claude / GPT / │ │ ┌──────────────────────────────┐ │
│ any litellm │ │ │ Local Sandbox (shared VM) │ │
│ supported model │ │ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │
│ │ │ │ │ fs │ │ term │ │ git │ │ │
└──────────────────┘ │ │ │ MCP │ │ MCP │ │ MCP │ │ │
│ │ └──────┘ └──────┘ └──────┘ │ │
│ │ /data workspace │ │
│ └──────────────────────────────┘ │
│ │
│ ┌──────────────────────────────┐ │
│ │ Individual Sandboxes │ │
│ │ ┌────────┐ ┌────────────┐ │ │
│ │ │ github │ │ woocommerce│ │ │
│ │ │ MCP │ │ MCP │ │ │
│ │ └────────┘ └────────────┘ │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────┘
Data flow during preprocess/evaluate subprocesses (hijack active here):
subprocess (preprocess/eval script)
│
│ Python auto-imports _hijack/sitecustomize.py via PYTHONPATH
│
├── imaplib.IMAP4("localhost", 1143)
│ └── socket.getaddrinfo("localhost", 1143)
│ └── [HIJACKED] → resolves to remote Klavis email IP:port
│
├── aiohttp.get("http://localhost:10001/api/v1/courses")
│ └── [URL HIJACKED] → rewritten to https://{canvas_domain}/api/v1/courses
│
├── open("configs/google_credentials.json")
│ └── [HIJACKED] → opens /tmp/google_creds_XXXX.json instead
│
└── os.environ.get("KLAVIS_GITHUB_TOKEN")
└── [NO HIJACK] → reads env var directly (set by auth_env)
Credential flow (preprocess & evaluation scripts):
Klavis API ──acquire──▶ auth_data (tokens, IPs, keys)
│
┌─────────┴──────────────┐
▼ ▼
Simple values Complex values
(tokens, URLs) (private keys, JSON creds)
│ │
▼ ▼
Set as env vars Write to temp files,
(KLAVIS_GITHUB_TOKEN, set HIJACK_* env vars
KLAVIS_WOOCOMMERCE_*) pointing to temp files
│ │
└────────┬───────────────┘
▼
_build_subprocess_env()
┌────────────────────────┐
│ Merges auth_env into │
│ subprocess environment │
│ │
│ If HIJACK_* vars set: │
│ prepend _hijack/ to │
│ PYTHONPATH │
└────────┬───────────────┘
▼
subprocess.run(preprocess/eval)
┌────────────────────────────┐
│ Python auto-imports │
│ _hijack/sitecustomize.py │
│ │
│ • socket.getaddrinfo │
│ patched → redirects │
│ localhost IMAP/SMTP │
│ │
│ • builtins.open / os.stat │
│ patched → redirects │
│ credential file reads │
└────────────────────────────┘
| Requirement | Details |
|---|---|
| Python | 3.10+ |
| Klavis API Key | Sign up at klavis.ai to get a KLAVIS_API_KEY |
| LLM API Key | At least one of: ANTHROPIC_API_KEY, OPENAI_API_KEY (depending on which model you use) |
# Clone this repo
git clone <repo-url>
cd Toolathlon-mvp
# Install dependencies
pip install -r requirements.txtCreate a .env file in the project root (or export directly):
# .env
KLAVIS_API_KEY=your_klavis_api_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here # if using Claude models
OPENAI_API_KEY=your_openai_key_here # if using GPT models# Single task (output to terminal)
python toolathlon_task_run_example.py --task tasks/finalpool/arrange-workspace
# Multiple tasks in parallel (each task logs to logs/<run>/)
python toolathlon_task_run_example.py --tasks tasks/finalpool/arrange-workspace tasks/finalpool/git-repo
# All 75 supported tasks (default when no --task/--tasks given)
python toolathlon_task_run_example.py --parallel 5When using --tasks or --all, each task runs as a separate subprocess with isolated stdout/stderr. An asyncio.Semaphore bounds concurrency to --parallel (default 10).
Each parallel run creates a timestamped directory under logs/:
logs/run_20260224_155517/
├── arrange-workspace.log # Full stdout/stderr per task
├── git-repo.log
└── summary.json # Machine-readable results (pass/fail/error counts)
A coloured summary table is also printed to the terminal.
Each task lives under tasks/finalpool/<task-name>/ and follows this canonical layout:
tasks/finalpool/<task-name>/
├── task_config.json # Required MCP servers & metadata
├── docs/
│ ├── task.md # Natural-language task prompt (sent to the agent)
│ ├── agent_system_prompt.md # System prompt template (!!<<<<||||workspace_dir||||>>>>!! → /data)
│ ├── task_cn.md # (Optional) Chinese translation
│ └── agent_system_prompt_cn.md # (Optional) Chinese system prompt
├── initial_workspace/ # Starting files uploaded to the sandbox
├── preprocess/
│ └── main.py # (Optional) Setup script run before upload
├── evaluation/
│ ├── main.py # Evaluation entry point
└── groundtruth_workspace/ # (Optional) Expected outputs for evaluation
| File | Purpose |
|---|---|
task_config.json |
Declares needed_mcp_servers (list of server names the task requires) and optional metadata |
docs/task.md |
The exact prompt the agent receives as its input |
docs/agent_system_prompt.md |
System instructions for the agent. The placeholder !!<<<<||||workspace_dir||||>>>>!! is replaced with /data at runtime |
preprocess/main.py |
Optional script executed locally before upload. Receives --agent_workspace <tmpdir> and can modify the workspace |
evaluation/main.py |
CLI wrapper: --agent_workspace <path> [--groundtruth_workspace <path>] |
{
"needed_mcp_servers": ["filesystem", "terminal", "pdf-tools", "excel"],
"needed_local_tools": ["claim_done", "python_execute"],
"meta": {}
}The Klavis API provides two types of sandboxes (both expose MCP servers over Streamable HTTP — see below):
| Type | API Endpoint | Description |
|---|---|---|
| Local Sandbox | POST /local-sandbox |
A shared VM with multiple MCP servers (filesystem, terminal, git, excel, etc.). Has a unified /data workspace. Files can be uploaded/downloaded. |
| Individual Sandbox | POST /sandbox/{server_name} |
A standalone MCP server instance for external services (github, woocommerce, snowflake, etc.). No shared filesystem. May return auth credentials. |
The runner automatically classifies each required server:
⚠️ Debugging Tip: After acquiring all sandboxes, log every MCP server URL returned by the Klavis API. This includes URLs from both local sandbox servers and individual sandbox servers. Having these URLs in your logs is critical for diagnosing tool call failures, timeout issues, or sandbox connectivity problems during a run.
# These servers run inside a Local Sandbox (shared VM)
LOCAL_SANDBOX_SERVERS = {
"filesystem", "git", "terminal", "desktop-commander",
"arxiv", "excel", "word", "powerpoint",
"code-executor", "code-runner", "pdf-tools",
"google_cloud", "poste_email_toolathlon", "canvas",
"playwright",
}
# Everything else → Individual SandboxImportant for implementers and coding agents: All MCP servers returned by the Klavis Sandbox API use the Streamable HTTP transport (
MCPServerStreamableHttpin the OpenAI Agents SDK).
This means:
- Every tool call is a stateless HTTP request. Each
tools/listortools/callis an independent HTTP POST to the server URL — similar to a REST API call. There is no persistent connection or long-lived session between calls. - No keep-alive or SSE streaming session is needed. You do not need to maintain a WebSocket, hold open an SSE connection, or implement reconnection logic. Simply send a request and receive a response.
- No session state on the transport layer. The server URL you receive from Klavis is self-contained. You can call it from any HTTP client at any time during the sandbox's lifetime.
Sample usage with the MCP Python SDK — each snippet is a standalone, stateless call:
from mcp.client.streamable_http import streamable_http_client
from mcp import ClientSession
# Each call opens a short-lived HTTP request — no persistent connection needed
async with streamable_http_client(mcp_url) as (r, w, _):
async with ClientSession(r, w) as session:
await session.initialize()
tools = await session.list_tools() # or call toolEmail MCP server — passing credentials via custom header:
The email server requires an x-email-config header (base64-encoded JSON with email, password, name). Pass it via a pre-configured httpx.AsyncClient:
import base64, json, httpx
email_cfg = {"email": "user@mcp.com", "password": "secret", "name": "User Name"}
headers = {"x-email-config": base64.b64encode(json.dumps(email_cfg).encode()).decode()}
http_client = httpx.AsyncClient(headers=headers, timeout=httpx.Timeout(120))
async with streamable_http_client(mcp_url, http_client=http_client) as (r, w, _):
async with ClientSession(r, w) as session:
await session.initialize()
result = await session.call_tool("check_connection", {})Note:
auth_datafrom the sandbox acquire response only contains connection info (IP, SMTP/IMAP ports) — the email credentials above are user-specified.
Other MCP servers requiring custom headers:
Like the email server above, several other MCP servers require custom headers passed the same way:
| Server | Header | Value source | Description |
|---|---|---|---|
canvas |
x-canvas-api-token |
token_key_session.py (canvas_api_token) |
Per-task Canvas API token (student/teacher persona). Required — the Canvas MCP server rejects requests without it. canvas_domain comes from Klavis). |
google_sheet |
x-sheets-folder-id |
files/folder_id.txt (generated by preprocess) |
Tells the Sheets server which Google Drive folder to operate on |
google-cloud |
x-google-cloud-allowed-buckets |
token_key_session.py |
Allowed GCS buckets |
google-cloud |
x-google-cloud-allowed-bigquery-datasets |
token_key_session.py |
Allowed BigQuery datasets |
google-cloud |
x-google-cloud-allowed-log-buckets |
token_key_session.py |
Allowed log buckets |
google-cloud |
x-google-cloud-allowed-instances |
token_key_session.py |
Allowed compute instances |
These headers are set after preprocessing (since some values are generated by preprocess scripts) and passed via the headers param, just like the email example:
# Google Sheets — folder ID from preprocess output
MCPServerStreamableHttp(params={"url": sheets_url, "headers": {"x-sheets-folder-id": folder_id}})
# Google Cloud — allowed resource scopes from token_key_session.py
MCPServerStreamableHttp(params={"url": gcloud_url, "headers": {
"x-google-cloud-allowed-buckets": "my-bucket",
"x-google-cloud-allowed-bigquery-datasets": "project.dataset",
}})Some task-defined server names differ from the Klavis API names. Two mapping dicts handle this:
| Dict | Purpose | Example |
|---|---|---|
TASK_TO_LOCAL_SANDBOX_NAME |
Task server name → Klavis local sandbox name | "pptx" → "powerpoint", "playwright_with_chunk" → "playwright" |
TASK_SERVER_TO_SANDBOX_NAME |
Task server name → Klavis individual sandbox name | "google_sheet" → "google_sheets" |
Some tasks need additional sandbox servers only for their preprocess or evaluation scripts — the agent itself never uses them. These are listed in EXTRA_PREPROCESS_EVAL_SERVERS:
EXTRA_PREPROCESS_EVAL_SERVERS = {
"fillout-online-forms": ["google_forms"],
}These servers are acquired alongside the task's needed_mcp_servers so credentials are available to preprocess/eval scripts, but they are not exposed to the agent as MCP tools.
These are changes made to task scripts in tasks/finalpool/ compared to the upstream Toolathlon/tasks/finalpool/:
| Task(s) | Script | Change |
|---|---|---|
8 WooCommerce tasks (filter-low-selling-products, inventory-sync, update-material-inventory, woocommerce-customer-survey, woocommerce-new-product, woocommerce-new-welcome, woocommerce-product-recall, woocommerce-stock-alert) |
token_key_session.py |
Hardcoded WooCommerce credentials replaced with os.environ.get("KLAVIS_WOOCOMMERCE_*", <original_fallback>) for api_key, api_secret, site_url. |
woocommerce-update-cover |
token_key_session.py |
Same as above, plus admin_username / admin_password also read from KLAVIS_WOOCOMMERCE_ADMIN_* env vars. |
course-assistant |
preprocess/send_email.py |
Changed use_auth=False → use_auth=True for LocalEmailSender (required by Klavis email server). |
fillout-online-forms |
preprocess/main.py |
Makes the created Google Form publicly accessible via Drive API (permissions().create() with type: anyone, role: reader). |
meeting-assign |
evaluation/main.py |
Kills lingering process on port 30137 after the email check, preventing it from blocking subsequent runs. |
This section is critical for understanding how the runner works. Without proper credential injection, preprocess and evaluation scripts cannot authenticate to external services (email, GitHub, Google Sheets, WooCommerce, Snowflake, etc.). The system uses three distinct strategies depending on what the service expects.
Toolathlon tasks were originally designed for a monolithic environment where:
- Email scripts connect to
localhost:1143(IMAP) andlocalhost:1587(SMTP) — hardcoded addresses. - Google Sheets/Forms scripts read credentials from
configs/google_credentials.json— a hardcoded file path. - GCP scripts read service-account keys from
configs/gcp-service_account.keys.json— another hardcoded file path. - GitHub/WooCommerce/Snowflake scripts read tokens via
from token_key_session import all_token_key_session— which reads from env vars.
In the Klavis sandbox setup, each service runs remotely with dynamically assigned addresses and credentials. The runner must bridge this gap without modifying the original task scripts. It does this through three injection strategies:
| Strategy | What it solves | Mechanism | Example servers |
|---|---|---|---|
| 1. Env vars | Scripts read tokens/keys from os.environ |
Set KLAVIS_* env vars before launching subprocess |
github, woocommerce, snowflake |
| 2. Network hijack | Scripts connect to hardcoded localhost ports (IMAP/SMTP) |
Monkeypatch socket.getaddrinfo() to redirect to remote IPs |
email (IMAP/SMTP) |
| 3. URL hijack | Scripts make HTTP requests to hardcoded localhost URLs |
Monkeypatch requests/aiohttp to rewrite URLs to external Canvas ingress |
canvas (HTTP API) |
| 4. File hijack | Scripts read credentials from hardcoded file paths | Monkeypatch builtins.open() / os.stat() to redirect to temp files |
google_sheets, google_forms, google_cloud |
| 5. MCP tool call adaptation | MCP tool expects a local file path the remote server can't access | Read file locally, pass contents as json_string instead of import_path |
email (import_emails) |
When used: For services where the task's preprocess/eval scripts read credentials via os.environ.get("KLAVIS_*") or via configs/token_key_session.py (which itself reads from env vars).
How it works:
- The runner calls
KlavisSandbox.acquire()for a server (e.g.,woocommerce). - The Klavis API returns
auth_datacontaining keys/tokens. _apply_sandbox_auth()maps eachauth_datafield to aKLAVIS_*env var usingSANDBOX_AUTH_ENV_MAPPING:
SANDBOX_AUTH_ENV_MAPPING = {
"woocommerce": {
"consumer_key": "KLAVIS_WOOCOMMERCE_CONSUMER_KEY",
"consumer_secret": "KLAVIS_WOOCOMMERCE_CONSUMER_SECRET",
"site_url": "KLAVIS_WOOCOMMERCE_SITE_URL",
"admin_username": "KLAVIS_WOOCOMMERCE_ADMIN_USERNAME",
"admin_password": "KLAVIS_WOOCOMMERCE_ADMIN_PASSWORD",
},
"github": {
"access_token": "KLAVIS_GITHUB_TOKEN",
},
"snowflake": {
"SNOWFLAKE_ACCOUNT": "KLAVIS_SNOWFLAKE_ACCOUNT",
"SNOWFLAKE_WAREHOUSE": "KLAVIS_SNOWFLAKE_WAREHOUSE",
# ... (all Snowflake fields)
},
}- These env vars are stored in
klavis.auth_envand passed to subprocesses via_build_subprocess_env(). - Task-level
token_key_session.py(orconfigs/token_key_session.py) reads them:
# configs/token_key_session.py
all_token_key_session = Dict(
github_token=os.environ.get("KLAVIS_GITHUB_TOKEN", ""),
# ...
)Special case — Snowflake private key: The Snowflake private key is a multi-line PEM string. It is written to a temp file, and KLAVIS_SNOWFLAKE_PRIVATE_KEY_PATH is set to point to that file. This is still env-var-based (no hijack needed) — the scripts read the path from the env var.
No hijack is involved. The _hijack/sitecustomize.py module is not activated for these servers.
When used: For the email server (poste_email_toolathlon), where task scripts hardcode localhost:1143 for IMAP and localhost:1587 for SMTP.
The problem: The email MCP server runs remotely in the Klavis cloud. Its auth_data returns the real IMAP/SMTP server IPs and ports. But preprocess/eval scripts connect to localhost:1143/localhost:1587 — hardcoded, not configurable.
How it works:
-
When acquiring the local sandbox, the runner extracts email
auth_data:# From acquire_for_servers(): if sname == "poste_email_toolathlon": auth = server.get("auth_data") or {} self.auth_env["HIJACK_IMAP_HOST"] = str(auth["imap_server"]) self.auth_env["HIJACK_IMAP_PORT"] = str(auth.get("imap_port", 1143)) self.auth_env["HIJACK_SMTP_HOST"] = str(auth["smtp_server"]) self.auth_env["HIJACK_SMTP_PORT"] = str(auth.get("smtp_port", 1587))
-
_build_subprocess_env()detectsHIJACK_IMAP_HOSTorHIJACK_SMTP_HOSTin the env → prepends_hijack/toPYTHONPATH. -
Python auto-imports
_hijack/sitecustomize.pyat subprocess startup (Python's built-in sitecustomize mechanism). -
sitecustomize.pypatchessocket.getaddrinfo():_REDIRECT_MAP = { 1143: ("HIJACK_IMAP_HOST", "HIJACK_IMAP_PORT"), # IMAP 1587: ("HIJACK_SMTP_HOST", "HIJACK_SMTP_PORT"), # SMTP } def _hijacked_getaddrinfo(host, port, ...): if host in {"localhost", "127.0.0.1", "::1"} and port in _REDIRECT_MAP: new_host = os.environ.get(host_env) new_port = os.environ.get(port_env) if new_host and new_port: host, port = new_host, int(new_port) return _orig_getaddrinfo(host, port, ...)
-
When the task script calls
imaplib.IMAP4("localhost", 1143), Python internally callssocket.getaddrinfo("localhost", 1143)→ the patch intercepts it and resolves to the real remote IP:port instead.
Result: The task script thinks it's connecting to localhost:1143, but actually connects to the Klavis email server. No script modification needed.
When used: For Canvas LMS (canvas), where task preprocess/eval scripts hardcode http://localhost:10001 (and occasionally localhost:20001) as the Canvas API base URL.
The problem: Unlike email (raw TCP sockets), Canvas scripts make HTTP requests to a full URL. The Canvas sandbox provides canvas_domain (e.g., <ingress-host>/{pod_id}) — an external ingress with a path prefix. A socket-level host:port redirect can't inject a path prefix or switch protocol to HTTPS.
How it works:
-
The runner extracts
canvas_domainfrom the Canvas sandboxauth_dataand constructs the external base URL:if sname == "canvas": auth = server.get("auth_data") or {} canvas_domain = auth.get("canvas_domain", "") if canvas_domain: self.auth_env["HIJACK_CANVAS_BASE_URL"] = f"https://{canvas_domain}"
-
sitecustomize.pypatchesrequests.Session.request()andaiohttp.ClientSession._request()to rewrite any URL starting withhttp://localhost:10001orhttp://localhost:20001to the target base URL:# http://localhost:10001/api/v1/courses # → https://<ingress-host>/{pod_id}/api/v1/courses
-
SSL verification is disabled for the redirected requests (the Canvas ingress uses a self-signed certificate).
Result: requests.get("http://localhost:10001/api/v1/courses") is transparently rewritten to https://<ingress-host>/{pod_id}/api/v1/courses. No task script changes needed.
When used: For Google Sheets, Google Forms, and Google Cloud (GCP), where task scripts read credentials from hardcoded file paths.
The problem:
- Google Sheets/Forms scripts open
configs/google_credentials.jsonto read OAuth tokens. - GCP scripts open
configs/gcp-service_account.keys.jsonto read service-account keys.
These files don't exist locally — the credentials come dynamically from Klavis auth_data.
How it works:
-
The runner writes credentials to a temp file:
# For Google Sheets/Forms (OAuth credentials): tf = tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".json") json.dump(creds, tf) # {token, refresh_token, client_id, client_secret, ...} self.auth_env["HIJACK_GOOGLE_CREDENTIALS_PATH"] = tf.name # For GCP (service-account key): tf = tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".json") json.dump(auth, tf) # Full GCP service-account JSON self.auth_env["HIJACK_GCP_SERVICE_ACCOUNT_PATH"] = tf.name
-
_build_subprocess_env()detectsHIJACK_GOOGLE_CREDENTIALS_PATHorHIJACK_GCP_SERVICE_ACCOUNT_PATH→ prepends_hijack/toPYTHONPATH. -
sitecustomize.pypatchesbuiltins.open(),io.open(),os.stat(), andpathlib.Path.stat():_FILE_REDIRECT_SUFFIXES = { "configs/google_credentials.json": "HIJACK_GOOGLE_CREDENTIALS_PATH", "configs/gcp-service_account.keys.json": "HIJACK_GCP_SERVICE_ACCOUNT_PATH", } def _hijacked_open(file, *args, **kwargs): for suffix, env_var in _FILE_REDIRECT_SUFFIXES.items(): if str(file).endswith(suffix): file = os.environ.get(env_var) # redirect to temp file return _orig_open(file, *args, **kwargs)
-
When a task script calls
open("configs/google_credentials.json"), the patch intercepts it and opens the temp file containing the real credentials instead. Theos.stat/Path.statpatches ensurePath.exists()checks also succeed.
Result: The task script reads its hardcoded credential path and gets real Klavis credentials. No script modification needed.
All three strategies converge in _build_subprocess_env(), which builds the environment dict for child processes:
def _build_subprocess_env(auth_env=None):
env = os.environ.copy()
pythonpath_parts = [str(PROJECT_ROOT)]
if auth_env:
env.update(auth_env) # Inject all KLAVIS_* and HIJACK_* vars
# If any HIJACK_* vars are present, prepend _hijack/ to PYTHONPATH
# so sitecustomize.py is auto-loaded and applies monkeypatches
if (env.get("HIJACK_IMAP_HOST")
or env.get("HIJACK_SMTP_HOST")
or env.get("HIJACK_CANVAS_BASE_URL")
or env.get("HIJACK_GOOGLE_CREDENTIALS_PATH")
or env.get("HIJACK_GCP_SERVICE_ACCOUNT_PATH")):
pythonpath_parts.insert(0, SOCKET_HIJACK_DIR) # _hijack/ directory
env["PYTHONPATH"] = os.pathsep.join(pythonpath_parts)
return envKey points:
- HIJACK_* vars absent →
_hijack/sitecustomize.pyis never loaded; no monkeypatching occurs. OnlyKLAVIS_*env vars are used. - HIJACK_* vars present →
_hijack/is prepended toPYTHONPATH, so Python auto-importssitecustomize.pyat startup, activating the relevant patches. - Each subprocess gets its own env → parallel sandbox sessions won't conflict.
If you are building your own Toolathlon runner (or adapting this code), here is what you need:
-
For env-var-based auth (github, woocommerce, snowflake):
- Acquire the sandbox, read
auth_data, store values asKLAVIS_*env vars. - Pass them to subprocess environments when running preprocess/eval scripts.
- Ensure
configs/token_key_session.pyreads fromos.environ.get("KLAVIS_*").
- Acquire the sandbox, read
-
For network hijack (email):
- Store the remote IMAP/SMTP host:port from
auth_dataasHIJACK_IMAP_HOST,HIJACK_IMAP_PORT,HIJACK_SMTP_HOST,HIJACK_SMTP_PORT. - Include
_hijack/inPYTHONPATHfor the subprocess. - The
_hijack/sitecustomize.pymodule will auto-patchsocket.getaddrinfo().
- Store the remote IMAP/SMTP host:port from
-
For URL hijack (canvas):
- Read
canvas_domainfromauth_dataand setHIJACK_CANVAS_BASE_URL=https://{canvas_domain}. This rewriteshttp://localhost:10001/...andhttp://localhost:20001/...to the external Canvas ingress URL (which includes a path prefix for pod routing). - Include
_hijack/inPYTHONPATHfor the subprocess. sitecustomize.pypatchesrequests.Session.request()andaiohttp.ClientSession._request()to rewrite matching URLs. SSL verification is disabled for these redirected requests.- The Canvas MCP server also requires an
x-canvas-api-tokenheader (per-task user token fromtoken_key_session.py). Thecanvas_domainis resolved fromx-auth-data(set by ingress from DB).
- Read
-
For file hijack (Google Sheets/Forms/Cloud):
- Write the
auth_dataJSON to a temp file. - Set
HIJACK_GOOGLE_CREDENTIALS_PATHorHIJACK_GCP_SERVICE_ACCOUNT_PATHto the temp file path. - Include
_hijack/inPYTHONPATHfor the subprocess. - The
_hijack/sitecustomize.pymodule will auto-patchopen(),os.stat(), andpathlib.Path.stat(). - Clean up temp files after the task completes (
cleanup_temp_files()).
- Write the
-
For MCP tool call adaptation (email import):
- The
import_emailstool expectsimport_path, but the preprocess runs locally and the email MCP server is remote — it can't access local files. call_tool_with_retry()inutils/mcp/tool_servers.pyautomatically reads the file locally and passes contents asjson_stringinstead ofimport_path. Transparent to callers.
- The
Note: Strategies 2–4 only affect local subprocess scripts (preprocess, evaluation). The agent's MCP tool calls go directly to Klavis server URLs over HTTP and don't need hijacking. Strategy 5 also only applies to local scripts calling remote MCP servers.
Toolathlon Notion tasks use a unique preprocessing approach: instead of initializing mock data from scratch, they duplicate and move existing Notion pages. In this klavis-toolathlon runner, this is handled seamlessly:
- Pre-configured Accounts: The Klavis Sandbox backend already manages the Notion account setup, integration keys, and page URLs. You do not need to perform manual Notion setup; these values are dynamically injected for use in
configs/token_key_session.py. - Official MCP Token Extraction & Auto-Refresh: When the runner acquires the
notionsandbox from Klavis, it extracts the Notion access token from the returned auth data and sets it asKLAVIS_NOTION_OFFICIAL_MCP_ACCESS_TOKEN(toolathlon_task_run_example.py). Furthermore, Klavis backend has OAuth server that automatically handles token refreshing, ensuring your Notion access token is always up to date when acquire sandbox. - Direct Official Connection: Using this token, the runner registers a
notion_officialMCP server pointing directly tohttps://mcp.notion.com/mcp(utils/mcp/tool_servers.py). This allows the local preprocess scripts to connect to the official Notion MCP to successfully duplicate and arrange the required pages.
Some eval scripts (e.g., academic-warning) format datetime.now() with a hardcoded +00:00 offset, assuming the host is UTC. On non-UTC machines this produces incorrect Cloud Logging time filters, causing evals to miss recently written logs.
The runner fixes this by generating launch_time with datetime.now(timezone.utc) and setting env["TZ"] = "UTC" for all preprocess/eval subprocesses.
Toolathlon-mvp/
├── toolathlon_task_run_example.py # Main entry point — single & parallel task runner
├── requirements.txt # Python dependencies
├── task_status_in_klavis_sandbox.md # Task support status reference
├── .env # Your API keys (create this)
├── logs/ # Per-run log directories (auto-created)
│ └── run_YYYYMMDD_HHMMSS/ # Each parallel run gets a timestamped dir
│ ├── <task-name>.log # Full stdout/stderr per task
│ └── summary.json # Machine-readable pass/fail results
├── _hijack/ # ⚡ Network, URL & file-open hijack module
│ └── sitecustomize.py # Auto-loaded via PYTHONPATH; patches
│ # socket.getaddrinfo() (email redirect),
│ # requests/aiohttp (Canvas URL rewrite),
│ # and builtins.open() (credential file redirect)
│ # to Klavis sandbox endpoints. See:
│ # "Credential Injection & Network/File Hijacking"
├── configs/
│ └── token_key_session.py # Reads KLAVIS_* env vars → all_token_key_session dict
├── tasks/
│ └── finalpool/ # All 108 Toolathlon benchmark tasks
│ ├── arrange-workspace/
│ ├── courses-ta-hws/
│ ├── inventory-sync/
│ └── ...
└── utils/ # Shared utilities
├── general/helper.py # normalize_str, read_json, write_json, etc.
├── app_specific/github/ # GitHub API helpers (used by preprocess/eval)
├── app_specific/huggingface/ # HuggingFace dataset helpers
└── data_processing/process_ops.py # File copy/duplication utilities