Step-by-step workflow for running agent benchmarks against this repo.
View this chart: paste it into mermaid.live, open this file in VSCode with the Markdown Preview Mermaid Support extension, or push to GitHub (renders natively).
flowchart TD
START([🚀 Start Here]) --> PREREQ
subgraph PREREQ["① Prerequisites"]
P1[Docker running\n≥ 8GB memory allocated]
P2[Python 3.8+]
P3[API Keys ready\nHF_TOKEN + LLM provider key]
P1 --- P2 --- P3
end
PREREQ --> INSTALL
subgraph INSTALL["② Install Python Dependencies"]
I1["python3 -m venv .venv && source .venv/bin/activate"]
I2["pip install -e '.[init]'"]
I3["pip install -r requirements_benchmark.txt"]
I1 --> I2 --> I3
end
INSTALL --> DATA
subgraph DATA["③ Download Benchmark Data ~30 GB"]
D1["export HF_TOKEN=hf_..."]
D2["make download"]
D1 --> D2
end
DATA --> DOCKER_CHOICE{Build or Pull?}
DOCKER_CHOICE -->|Build locally| BUILD
DOCKER_CHOICE -->|Use pre-built image| PULL
subgraph BUILD["④a Build Image"]
B1["make build"]
end
subgraph PULL["④b Pull from Docker Hub"]
PL1["make pull"]
end
BUILD --> START_CONTAINERS
PULL --> START_CONTAINERS
subgraph START_CONTAINERS["⑤ Start 4 Capability Containers"]
SC1["make start\nor: docker compose up -d"]
SC2["Wait ~60 seconds for startup"]
SC3{"docker ps shows\n4 containers?"}
SC1 --> SC2 --> SC3
SC3 -->|No| TROUBLESHOOT([📖 Check docs/debugging.md])
SC3 -->|Yes| VALIDATED
end
VALIDATED["✅ Containers Ready"]
VALIDATED --> EXPLORE_CHOICE{New to the benchmark?}
EXPLORE_CHOICE -->|Yes, explore first| EXPLORE
EXPLORE_CHOICE -->|No, go straight to running| AGENT_CHOICE
subgraph EXPLORE["⑥ Optional: Explore Tools"]
E1["List tools for a domain\npython examples/quick_start_mcp_tools/list_tools.py\n--capability-id 2 --domain hockey"]
E2["Invoke a tool manually\npython examples/quick_start_mcp_tools/invoke_tool.py\n--capability-id 2 --domain hockey --tool get_teams"]
E3["Test raw MCP connection\npython examples/quick_start_benchmark/simple_docker.py\n--capability-id 2 --domain hockey"]
E1 --> E2 --> E3
end
EXPLORE --> AGENT_CHOICE
AGENT_CHOICE{Using built-in\nor custom agent?}
AGENT_CHOICE -->|Built-in LangGraph ReAct| BUILTIN
AGENT_CHOICE -->|Custom agent| CUSTOM
subgraph BUILTIN["⑦a Use Built-in Agent"]
BI1["Smoke test — 1 sample\npython benchmark_runner.py\n--capability_id 1\n--domain california_schools\n--max-samples-per-domain 1\n--provider openai"]
BI2["Full run\npython benchmark_runner.py\n--capability_id 1 2 3 4\n--provider openai --model gpt-4o"]
BI1 --> BI2
end
subgraph CUSTOM["⑦b Plug In Custom Agent"]
CA1["Copy example template\nexamples/quick_start_benchmark/run_benchmark.py"]
CA2["Implement the agent block\nasync with session: ...\n tools = await session.list_tools()\n answer = your_agent(query, tools, session)"]
CA3["Run your custom runner\npython your_runner.py --capability 2"]
CA1 --> CA2 --> CA3
end
BUILTIN --> RESULTS
CUSTOM --> RESULTS
subgraph RESULTS["⑧ Analyze Results"]
R1["Output written to\noutput/capability_N_timestamp/domain.json"]
R2["Each record contains:\n• answer • tool_calls\n• trajectory • status / error • duration_s"]
R1 --> R2
end
RESULTS --> DONE([🏁 Done!])
style START fill:#4CAF50,color:#fff
style DONE fill:#4CAF50,color:#fff
style VALIDATED fill:#2196F3,color:#fff
style TROUBLESHOOT fill:#FF5722,color:#fff
style PREREQ fill:#FFF9C4
style INSTALL fill:#FFF9C4
style DATA fill:#FFF9C4
style BUILD fill:#E8F5E9
style PULL fill:#E8F5E9
style START_CONTAINERS fill:#E3F2FD
style EXPLORE fill:#F3E5F5
style BUILTIN fill:#E0F7FA
style CUSTOM fill:#E0F7FA
style RESULTS fill:#FBE9E7
| Step | Action | Command |
|---|---|---|
| ① | Prerequisites | Docker ≥ 8 GB, Python 3.8+, API keys |
| ② | Install deps | pip install -e '.[init]' && pip install -r requirements_benchmark.txt |
| ③ | Download data | export HF_TOKEN=hf_... && make download |
| ④ | Get Docker image | make build or make pull |
| ⑤ | Start containers | make start → wait 60s → docker ps |
| ⑥ | (Optional) Explore | list_tools.py, invoke_tool.py, simple_docker.py |
| ⑦ | Run benchmark | benchmark_runner.py or your custom runner |
| ⑧ | Analyze results | output/capability_N_timestamp/domain.json |
| Capability | What it tests | Key domains |
|---|---|---|
| 1 | Tool selection & slot filling | california_schools, hockey |
| 2 | SQL query construction via REST | hockey, address |
| 3 | Multi-hop (BPO + SQL routing) | address |
| 4 | Multi-turn with semantic search | any |
# Smoke test — 1 sample, 1 domain
python benchmark_runner.py --capability_id 1 --domain california_schools --max-samples-per-domain 1 --provider openai
# Full run across all capabilities
python benchmark_runner.py --capability_id 1 2 3 4 --provider openai --model gpt-4o
# Run capabilities in parallel
python benchmark_runner.py --capability_id 2 4 --parallel
# Use Anthropic instead
python benchmark_runner.py --capability_id 2 --provider anthropic --model claude-sonnet-4-5-20250929
# Enable top-k tool shortlisting
python benchmark_runner.py --capability_id 2 --top-k-tools 10
# Custom output directory
python benchmark_runner.py --capability_id 2 --output my_results/Copy examples/quick_start_benchmark/run_benchmark.py and replace the agent placeholder:
async with stdio_client(params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
tools = (await session.list_tools()).tools
# Replace this block with your agent
answer = await your_agent(query=item.query, tools=tools, session=session)Your agent receives:
item.query— the natural language questiontools— list of MCP tool definitionssession— live MCP session; callsession.call_tool(name, args)to invoke tools
| Option | How |
|---|---|
| Online (easiest) | Paste the Mermaid block into mermaid.live |
| VSCode | Install Markdown Preview Mermaid Support, then open preview (Cmd+Shift+V) |
| GitHub | Push this file — GitHub renders Mermaid natively in .md files |
| JetBrains IDEs | Install the Mermaid plugin |