This document describes the server-side architecture of the benchmark — one section per task, each covering its container, the services running inside it, the MCP layer the benchmark client connects to, and a full end-to-end diagram.
+------------------------------------------+
| LLM Provider API |
| OpenAI | Anthropic | Ollama | LiteLLM |
+-------------------+----------------------+
| HTTPS (chat completions)
= = = = = = = = = = = = = = = = = = = = + = = = = = = = HOST = = = = = = = = = = = = =
|
+─────────────────────────+──────────────────────────+
| benchmark_runner.py |
| |
| · Reads domain questions from data/ |
| · Runs a ReAct agent loop (LangGraph) |
| · Calls LLM API above for reasoning |
| · Calls MCP tools below for data access |
| · Scores answers → output/{task}/{domain}.json |
+─────────────────────────+──────────────────────────+
|
docker exec -i, MCP stdio
CAPABILITY_ID=N, MCP_DOMAIN=<domain>
|
= = = = = = = = = = = = = = = = = = = = + = = = CONTAINERS = = = = = = = = = = = = = =
|
+──────────────────────────────────+──────────────────────────────────────+
| image: m3_environ |
| |
| +──────────────+ +──────────────+ +──────────────+ +─────────────────+ |
| | cap_1_bi_apis | | cap_2_dashboard | | cap_3_multihop | | cap_4_multiturn | |
| | Sel / Slot | | M3 REST | | BPO + M3 | | M3 REST + | |
| | MCP server | | MCP server | | REST router | | Retriever | |
| +──────┬───────+ +──────┬───────+ +──────┬───────+ +────┬───────┬────+ |
| | | | | | |
| +────────────────+────────────────+--------------+ | |
| | |
| mcp_dispatch.py (os.execv → per-task MCP server) | |
| | | |
| FastAPI :8000 — M3 REST API (all tasks) FastAPI :8001 | |
| | Retriever API | |
| | (capability_4 only) | |
| +────────────+──────────────────+ +────────────────────+| |
| | SQLite /app/db/ | | ChromaDB + |
| | 60+ domain databases | | 62 collections |
| | (capabilities 1, 2, 3, 4) | | (capability_4 only) |
| +───────────────────────────────+ +─────────────────────────────+
+─────────────────────────────────────────────────────────────────────────+
All tasks share a single Docker image (m3_environ) built from
docker/Dockerfile.unified. Four named containers
are started from this image, one per task. The image bundles every server
component; which pieces are active depends on the container's entrypoint and
the docker exec command used to start the MCP server.
The benchmark runner never opens a network socket to a container. Instead it runs:
docker exec -i -e CAPABILITY_ID=<N> -e MCP_DOMAIN=<domain> <container> python /app/mcp_dispatch.py
That spawns a short-lived MCP server process inside the container. The process
speaks the MCP stdio protocol back over
stdin/stdout. The benchmark client (benchmark/mcp_client.py) wraps this as
an MCP ClientSession and calls list_tools() / call_tool() as normal.
All tasks share a single container entrypoint — /app/mcp_dispatch.py. It
reads the CAPABILITY_ID environment variable and calls os.execv() to replace
itself with the correct server process. There is no proxy layer; the stdio
pipe connects directly to the target server after exec.
CAPABILITY_ID |
Exec target |
|---|---|
1 |
python -m apis.m3.python_tools.mcp |
2 |
python /app/m3-rest/mcp_server.py |
3 |
python /app/environment/bpo/mcp/bpo_router.py |
4 |
python /app/retrievers/capability_4_mcp_server.py |
The original server scripts remain intact and are still tested directly by
docker/smoke_test.sh. The dispatcher is just a thin routing shim.
When a container first starts,
docker/entrypoint-unified.sh launches the
long-lived FastAPI background services and waits for them to become healthy
before declaring the container ready:
| Port | Service | Health check |
|---|---|---|
| 8000 | M3 REST FastAPI | GET /openapi.json |
| 8001 | Retriever FastAPI | GET /health (skipped if no chroma_data/) |
Both services stay up indefinitely and handle all subsequent docker exec
invocations for that container's lifetime.
The agent must identify the correct tool and fill its parameter slots from a natural-language query. A single MCP entry point exposes either a generic slot-filling toolset or a specialised selection toolset (with dynamically generated column getters), switching between them at runtime based on which "tool universe" the agent selects for a given query.
| What | Detail |
|---|---|
| FastAPI services | M3 REST on port 8000 (started but not used by MCP) |
| MCP entry point | python -m apis.m3.python_tools.mcp |
| Entry CLI | apis/m3/python_tools/mcp/cli.py |
| Server factory | apis/m3/python_tools/mcp/mcp_server.py — create_server() |
| Active server class | RouterMCPServer (selected via MCP_SERVER_TYPE=router) |
cli.py reads environment variables and calls create_server(), which
instantiates a RouterMCPServer (the default when MCP_SERVER_TYPE=router).
At startup the router:
- Reads
mcp_tool_universe_id_mapping.yamland filters the universe list to entries whosedomainfield matchesMCP_DOMAIN. - Initialises both
SlotFillingToolsandSelectionToolsin memory. - Loads the first universe's data from
SQLite(/app/db/{domain}/{domain}.sqlite) intoactive_data. - Defaults to the slot-filling toolset as the active tool set.
The agent interacts with the server through two mechanisms:
list_tools()— returns whichever toolset is currently active.call_tool("get_data", tool_universe_id=<uid>)— switchesactive_datato the requested universe; if the universe's YAML entry containsserver_type: selection, the router also switches the active toolset toSelectionTools(regenerating column-level getter functions from the new schema). A subsequentlist_tools()then returns the selection toolset.
No HTTP calls are made during tool execution — all data access is direct Python reads from the pre-loaded SQLite database.
| Toolset | Active when | Tools exposed |
|---|---|---|
| Slot-filling | universe server_type absent or slot_filling |
filter_data, sort_data, aggregate_data, transform_data, retrieve_data, Calculator, concatenate_data, select_unique_values, peek_fcn, get_data |
| Selection | universe server_type: selection |
8 × select_data_* filters, 2 × sort_data_*, 8 × compute_data_* aggregates, truncate, 3 × transform_data_to_*, Calculator, concatenate_data, select_unique_values, dynamically generated get_{column} getters, get_data |
Benchmark Client (host)
│
│ docker exec -i
│ -e CAPABILITY_ID=1
│ -e MCP_DOMAIN=<domain>
│ -e MCP_SERVER_TYPE=router
│ -e MCP_DB_ROOT=/app/db
│ capability_1_bi_apis_m3_environ
│ python /app/mcp_dispatch.py
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ host / container boundary ─ ─ ─ ─
▼
┌────────────────────────────────────────────────────────────┐
│ capability_1_bi_apis_m3_environ │
│ │
│ mcp_dispatch.py (CAPABILITY_ID=1) │
│ └─ os.execv → python -m apis.m3.python_tools.mcp │
│ │
│ cli.py → create_server() → RouterMCPServer (stdio) │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ RouterMCPServer │ │
│ │ │ │
│ │ startup: │ │
│ │ • load mcp_tool_universe_id_mapping.yaml │ │
│ │ • filter universes to MCP_DOMAIN │ │
│ │ • init SlotFillingTools + SelectionTools │ │
│ │ • load active_data from SQLite (first universe) │ │
│ │ • active toolset = slot_filling │ │
│ │ │ │
│ │ list_tools() │ │
│ │ → current active toolset tools │ │
│ │ │ │
│ │ call_tool("get_data", tool_universe_id=<uid>) │ │
│ │ → reload active_data from SQLite │ │
│ │ → if universe server_type == "selection": │ │
│ │ regenerate get_{col} getters from schema │ │
│ │ switch active toolset → SelectionTools │ │
│ │ → else: active toolset → SlotFillingTools │ │
│ │ │ │
│ │ call_tool(<other>, ...) │ │
│ │ → direct Python call on active toolset │ │
│ └─────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌───────────┴──────────────┐ │
│ ▼ ▼ │
│ mcp_tool_universe_id_mapping.yaml SQLite │
│ /app/apis/configs/ /app/db/{domain}/ │
│ (universe IDs, init_args, {domain}.sqlite │
│ server_type per query) │
└────────────────────────────────────────────────────────────┘
The agent answers questions by calling SQL-backed REST endpoints for a single domain. Tools are auto-discovered from the FastAPI OpenAPI spec so the tool list reflects the live schema without any manual registration.
| What | Detail |
|---|---|
| FastAPI services | M3 REST on port 8000 |
| MCP entry point | python /app/m3-rest/mcp_server.py |
| MCP source | apis/m3/rest/mcp_server.py |
FastAPIMCPServer fetches /openapi.json from the M3 REST FastAPI on startup,
filters routes to those under /v1/{MCP_DOMAIN}/, and converts each route to
an MCP Tool. On call_tool, it reconstructs the HTTP request (path params,
query params, request body) and proxies it to FastAPI, which executes the SQL
query against the domain's SQLite database.
Benchmark Client (host)
│
│ docker exec -i -e CAPABILITY_ID=2 -e MCP_DOMAIN=address
│ capability_2_dashboard_apis_m3_environ
│ python /app/mcp_dispatch.py
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ host / container boundary ─ ─ ─ ─
▼
┌──────────────────────────────────────────────────┐
│ capability_2_dashboard_apis_m3_environ │
│ │
│ mcp_dispatch.py (CAPABILITY_ID=2) │
│ └─ os.execv → python /app/m3-rest/mcp_server.py│
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ FastAPIMCPServer (stdio) │ │
│ │ │ │
│ │ • list_tools() → GET :8000/openapi.json │ │
│ │ • call_tool() → HTTP → FastAPI │ │
│ └──────────────────┬─────────────────────────┘ │
│ │ HTTP :8000 │
│ ▼ │
│ M3 REST FastAPI (uvicorn, port 8000) │
│ /v1/{domain}/* routes (≈40 per domain) │
│ │ │
│ ▼ │
│ SQLite /app/db/{domain}/*.sqlite │
└──────────────────────────────────────────────────┘
GET /v1/{domain}/{resource} → SQL SELECT all
GET /v1/{domain}/{resource}/{id} → SQL SELECT by PK
POST /v1/{domain}/{resource}/filter → SQL WHERE clause
... (~40 routes per domain, 45+ domains)
Source: apis/m3/rest/app.py, per-domain routers in
apis/m3/rest/server/.
The benchmark tests routing across two heterogeneous tool sets: the BPO
(Business Process Outsourcing) tools for the bpo domain, and the standard
M3 REST SQL tools for all other domains. A single MCP entry point handles
both by exec-replacing itself with the correct server at startup.
| What | Detail |
|---|---|
| FastAPI services | M3 REST on port 8000 |
| MCP entry point | python /app/environment/bpo/mcp/bpo_router.py |
| Router source | environment/bpo/mcp/bpo_router.py |
| BPO server source | apis/bpo/mcp/server.py |
| M3 server source | apis/m3/rest/mcp_server.py |
bpo_router.py is a six-line shim. It reads MCP_DOMAIN and calls
os.execv to replace itself with the target server process — there is no
proxy layer. The MCP client's stdio pipe connects directly to the chosen server.
target = BPO_SERVER if domain == "bpo" else M3_REST_SERVER
os.execv(sys.executable, [sys.executable, target])- BPO server (
bpodomain): uses theFastMCPframework with@mcp.tool()decorators. Tools are statically defined and call in-process Python functions against BPO data — no FastAPI involved. - M3 REST server (all other domains): same
FastAPIMCPServeras Task 2, wrapping the M3 REST FastAPI on port 8000.
Benchmark Client (host)
│
│ docker exec -i -e CAPABILITY_ID=3 -e MCP_DOMAIN=bpo (or MCP_DOMAIN=airline)
│ capability_3_multihop_reasoning_m3_environ
│ python /app/mcp_dispatch.py
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ host / container boundary ─ ─ ─ ─
▼
┌──────────────────────────────────────────────────────────────┐
│ capability_3_multihop_reasoning_m3_environ │
│ │
│ mcp_dispatch.py (CAPABILITY_ID=3) │
│ └─ os.execv → bpo_router.py ──┬── MCP_DOMAIN=bpo ───► │
│ │ │
│ ┌──────────┴──────────────────────┐ │
│ │ BPO FastMCP server (stdio) │ │
│ │ @mcp.tool() — 30+ static tools │ │
│ │ → in-process BPO Python APIs │ │
│ └─────────────────────────────────┘ │
│ │ │
│ └── MCP_DOMAIN=<other> ► │
│ │
│ ┌─────────────────────────────────┐ │
│ │ FastAPIMCPServer (stdio) │ │
│ │ → GET :8000/openapi.json │ │
│ │ → HTTP /v1/{domain}/... │ │
│ └──────────────┬──────────────────┘ │
│ │ HTTP :8000 │
│ ▼ │
│ M3 REST FastAPI (port 8000) │
│ │ │
│ ▼ │
│ SQLite /app/db/{domain}/*.sqlite │
└──────────────────────────────────────────────────────────────┘
The agent has access to both structured SQL tools (M3 REST) and unstructured semantic search (ChromaDB retriever) in a single session. Tool filtering is asymmetric by design:
- M3 REST tools are scoped to the primary domain only (e.g.
/v1/address/*). - Retriever tools are exposed for the primary domain plus its negative
(distractor) domains looked up from
domain_negatives.json. Foraddressthis meansquery_address,query_olympics,query_card_games,query_legislator, andquery_craftbeer. The agent must retrieve from the correct collection despite having access to confusable alternatives.
| What | Detail |
|---|---|
| FastAPI services | M3 REST on port 8000 + Retriever on port 8001 |
| MCP entry point | python /app/retrievers/capability_4_mcp_server.py |
| MCP source | environment/retrievers/capability_4_mcp_server.py |
| Negatives config | apis/retrievers/domain_negatives.json |
Capability4CombinedMCPServer fetches /openapi.json from both FastAPI servers
at startup, builds a merged tools_cache, and stores the originating
backend_url in each tool's _metadata. On call_tool it looks up the tool,
reads _metadata["backend_url"], and routes the HTTP request to the correct
service.
The negatives for the primary domain are resolved once in main() via
load_retriever_domains(), which reads domain_negatives.json and expands the
primary domain list before the server is constructed.
| Backend | Filter | Tools exposed | Tool name pattern |
|---|---|---|---|
M3 REST :8000 |
primary domain only | ~40 | operationId from OpenAPI (e.g. get_address_streets) |
Retriever :8001 |
primary + negatives | ~5 | query_{domain} (e.g. query_address, query_olympics) |
Benchmark Client (host)
│
│ docker exec -i -e CAPABILITY_ID=5 -e MCP_DOMAIN=address
│ capability_4_multiturn_m3_environ
│ python /app/mcp_dispatch.py
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ host / container boundary ─ ─ ─ ─
▼
┌────────────────────────────────────────────────────────────────────┐
│ capability_4_multiturn_m3_environ │
│ │
│ mcp_dispatch.py (CAPABILITY_ID=5) │
│ └─ os.execv → python /app/retrievers/capability_4_mcp_server.py │
│ │
│ domain_negatives.json["address"] │
│ → [address, olympics, card_games, legislator, craftbeer] │
│ │ │
│ ┌───────────────────────▼────────────────────────────────────┐ │
│ │ Capability4CombinedMCPServer (stdio) │ │
│ │ │ │
│ │ list_tools() │ │
│ │ ├─ GET :8000/openapi.json │ │
│ │ │ filter: /v1/address/* → M3 REST tools (≈40) │ │
│ │ └─ GET :8001/openapi.json │ │
│ │ filter: address + negatives → query_address, │ │
│ │ query_olympics, query_card_games, ... (5) │ │
│ │ │ │
│ │ call_tool(name, args) [routes via _metadata.backend_url]│ │
│ │ ├─ M3 REST tool → HTTP → :8000/v1/address/... │ │
│ │ └─ query_* → HTTP → :8001/{domain}/query │ │
│ └────────────┬──────────────────────────┬──────────────────┘ │
│ │ HTTP :8000 │ HTTP :8001 │
│ ▼ ▼ │
│ M3 REST FastAPI Retriever FastAPI │
│ (port 8000) (port 8001) │
│ │ │ │
│ ▼ ▼ │
│ SQLite ChromaDB collections │
│ /app/db/{domain}/ (one per domain) │
│ *.sqlite /app/retrievers/chroma_data/ │
└────────────────────────────────────────────────────────────────────┘
POST /{domain}/query body: { query: str, n_results: int }
→ ranked document chunks (ChromaDB semantic search)
Source: apis/retrievers/server.py. Embeddings
model: ibm-granite/granite-embedding-english-r2 (pre-downloaded into the
image; takes up to 5 min to warm up on first container start).
| File | Role |
|---|---|
docker/Dockerfile.unified |
Single image for all tasks |
docker/entrypoint-unified.sh |
Starts FastAPI services; waits for health |
docker/mcp_dispatch.py |
Single MCP entrypoint — reads CAPABILITY_ID, os.execv()s into the right server |
benchmark/mcp_connection_config.yaml |
Maps task IDs → containers + MCP commands |
benchmark/mcp_client.py |
Builds docker exec command; opens MCP ClientSession |
apis/m3/rest/app.py |
M3 REST FastAPI — 45+ domain routers, port 8000 |
apis/m3/rest/mcp_server.py |
OpenAPI→MCP wrapper for M3 REST (Tasks 2 & 3) |
apis/m3/python_tools/mcp/cli.py |
CLI entry point; reads env vars, calls create_server() (Task 1) |
apis/m3/python_tools/mcp/mcp_server.py |
RouterMCPServer / SlotFillingMCPServer / SelectionMCPServer + create_server() factory (Task 1) |
apis/m3/python_tools/mcp/config.py |
MCPServerConfig dataclass; resolves MCP_DOMAIN → SQLite path (Task 1) |
apis/configs/mcp_tool_universe_id_mapping.yaml |
Tool universe registry — universe IDs, init args, server_type per query (Task 1) |
apis/bpo/mcp/server.py |
BPO FastMCP server — @mcp.tool() decorators (Task 3) |
environment/bpo/mcp/bpo_router.py |
os.execv router → BPO or M3 REST (Task 3) |
apis/retrievers/server.py |
Retriever FastAPI — ChromaDB, port 8001 |
environment/retrievers/capability_4_mcp_server.py |
Combined MCP server — merges M3 REST + Retriever (Capability 4) |
apis/retrievers/domain_negatives.json |
Maps each domain to its distractor domains for retriever tool expansion (Task 5) |