|
1 | | -# Agent Environment |
| 1 | +# MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers |
2 | 2 |
|
3 | | -A complete setup with ~45 pre-configured Model Context Protocol (MCP) servers for AI agents, plus an agent completion service and evaluation scripts for testing. |
| 3 | +MCP-Atlas is a comprehensive benchmark for evaluating AI models' tool-use capabilities across 45 Model Context Protocol (MCP) servers. It provides a standardized environment for running agent completions and evaluating performance with LLM-as-judge methodology. |
4 | 4 |
|
5 | | -Some MCP servers don't require API keys, but many others require you to get your own API keys. See `env.template` for where to get keys. |
| 5 | +- Paper: [LINK TO PAPER - TODO] |
| 6 | +- Leaderboard: [https://scale.com/leaderboard/mcp_atlas](https://scale.com/leaderboard/mcp_atlas) |
| 7 | +- Dataset: [LINK TO HUGGINGFACE/DATASET - TODO] |
| 8 | + |
| 9 | +## What is MCP-Atlas? |
| 10 | +MCP-Atlas evaluates how well AI agents can use tools to complete real-world tasks. The benchmark includes: |
| 11 | + |
| 12 | +- 45 MCP servers spanning categories like search, code execution, databases, APIs, and productivity tools |
| 13 | + - 25 don't require any setup, 15 require you to get API keys, and 5 require API keys and data setup (detailed below). |
| 14 | +- 500 evaluation prompts with ground-truth expected tool calls and answers |
| 15 | +- LLM-as-judge evaluation producing pass rate, coverage rate, and detailed diagnostics |
| 16 | +- Dockerized environment ensuring reproducible results across different machines |
| 17 | + |
| 18 | + |
6 | 19 |
|
7 | 20 | ## Quick Start |
8 | 21 |
|
9 | | -This project depends on these CLI tools: [docker](https://www.docker.com/products/docker-desktop/), [uv](https://docs.astral.sh/uv/getting-started/installation/#installation-methods), and python3. |
| 22 | +This project depends on these CLI tools: [docker](https://www.docker.com/products/docker-desktop/), [uv](https://docs.astral.sh/uv/getting-started/installation/#installation-methods), [jq](https://jqlang.org/download/), and python 3.10+. |
| 23 | + |
| 24 | +```bash |
| 25 | +git clone git@github.com:scaleapi/mcp-atlas.git |
| 26 | +cd mcp-atlas |
| 27 | +``` |
10 | 28 |
|
11 | 29 | ### 1. Configure environment |
12 | 30 |
|
@@ -38,7 +56,7 @@ make build && make run-docker |
38 | 56 |
|
39 | 57 | This starts the agent-environment service on port 1984 (takes 1+ minute to initialize). Before continuing, please wait for this to finish, you'll see log "Uvicorn running on http://0.0.0.0:1984". |
40 | 58 |
|
41 | | -By default, 25 servers that don't require API keys are enabled. Servers requiring API keys are auto-enabled only if you've set their keys in `.env`. To see the enabled mcp servers and confirm they're online: `curl -s http://localhost:1984/enabled-servers | jq -c` |
| 59 | +By default, [25 servers](services/agent-environment/src/agent_environment/mcp_client.py#L23) that don't require API keys are enabled. Servers requiring API keys are auto-enabled only if you've set their keys in `.env`. To see the enabled mcp servers and confirm they're online: `curl -s http://localhost:1984/enabled-servers | jq -c` |
42 | 60 |
|
43 | 61 | Optional: to check what tools are available, you can use this CURL script `./services/agent-environment/dev_scripts/debug_and_concurrency_tests/curl_scripts/mcp__list_tools.sh | jq > list_tools.json ; open list_tools.json` |
44 | 62 |
|
@@ -71,7 +89,7 @@ curl -X POST http://localhost:3000/v2/mcp_eval/run_agent \ |
71 | 89 | cd services/mcp_eval |
72 | 90 | ``` |
73 | 91 |
|
74 | | -Run the script with a small sample of 10 tasks. This will use the specified input CSV file. It should be solvable with only the 25 MCP servers that don't require any API keys (enabled by default). For details on servers, see `env.template` and `mcp_server_template.json`. |
| 92 | +Run the script with a small sample of 10 tasks. This will use the specified input CSV file. It should be solvable with only the 25 MCP servers that don't require any API keys (enabled by default). For details on servers, see `env.template` and [`mcp_server_template.json`](services/agent-environment/src/agent_environment/mcp_server_template.json). |
75 | 93 |
|
76 | 94 | ```bash |
77 | 95 | uv run python mcp_completion_script.py \ |
@@ -133,7 +151,7 @@ Approximately 18% of evaluation tasks work with the 25 default servers. To run m |
133 | 151 | - **MongoDB** - Restore `data_exports/mongo_dump_video_game_store-UNZIP-FIRST.zip` (486KB) using `mongorestore` |
134 | 152 | - **Slack** - Import `data_exports/slack_mcp_eval_export_add100days.zip` (43KB) at your workspace's import page |
135 | 153 |
|
136 | | -**See `data_exports/README.md` for detailed setup instructions for each service.** Without this sample data, these MCP servers will still function but may return empty results when evaluation tasks reference specific data. |
| 154 | +**See [`data_exports/README.md`](data_exports/README.md) for detailed setup instructions for each service.** Without this sample data, tasks that use these servers will return erroneous results because they cannot find the expected data. |
137 | 155 |
|
138 | 156 | Note: Some services are paid and require billing setup. |
139 | 157 |
|
@@ -167,7 +185,7 @@ uv run mcp_evals_scores.py \ |
167 | 185 |
|
168 | 186 | ## What's Included |
169 | 187 |
|
170 | | -- **45+ MCP servers** including calculator, Wikipedia, filesystem, Git, weather, GitHub, and more |
| 188 | +- **45 MCP servers** including calculator, Wikipedia, filesystem, Git, weather, GitHub, and more |
171 | 189 | - **Agent completion service** for running multi-turn LLM conversations with tool use |
172 | 190 | - **Docker containerization** for consistent MCP server environments |
173 | 191 | - **HTTP APIs** for tool calling and listing available tools |
|
0 commit comments