Skip to content

Commit da71ef3

Browse files
committed
improve readme
1 parent b193b70 commit da71ef3

File tree

6 files changed

+68
-171
lines changed

6 files changed

+68
-171
lines changed

README.md

Lines changed: 26 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,30 @@
1-
# Agent Environment
1+
# MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
22

3-
A complete setup with ~45 pre-configured Model Context Protocol (MCP) servers for AI agents, plus an agent completion service and evaluation scripts for testing.
3+
MCP-Atlas is a comprehensive benchmark for evaluating AI models' tool-use capabilities across 45 Model Context Protocol (MCP) servers. It provides a standardized environment for running agent completions and evaluating performance with LLM-as-judge methodology.
44

5-
Some MCP servers don't require API keys, but many others require you to get your own API keys. See `env.template` for where to get keys.
5+
- Paper: [LINK TO PAPER - TODO]
6+
- Leaderboard: [https://scale.com/leaderboard/mcp_atlas](https://scale.com/leaderboard/mcp_atlas)
7+
- Dataset: [LINK TO HUGGINGFACE/DATASET - TODO]
8+
9+
## What is MCP-Atlas?
10+
MCP-Atlas evaluates how well AI agents can use tools to complete real-world tasks. The benchmark includes:
11+
12+
- 45 MCP servers spanning categories like search, code execution, databases, APIs, and productivity tools
13+
- 25 don't require any setup, 15 require you to get API keys, and 5 require API keys and data setup (detailed below).
14+
- 500 evaluation prompts with ground-truth expected tool calls and answers
15+
- LLM-as-judge evaluation producing pass rate, coverage rate, and detailed diagnostics
16+
- Dockerized environment ensuring reproducible results across different machines
17+
18+
![MCP-Atlas Architecture](assets/architecture-diagram.png)
619

720
## Quick Start
821

9-
This project depends on these CLI tools: [docker](https://www.docker.com/products/docker-desktop/), [uv](https://docs.astral.sh/uv/getting-started/installation/#installation-methods), and python3.
22+
This project depends on these CLI tools: [docker](https://www.docker.com/products/docker-desktop/), [uv](https://docs.astral.sh/uv/getting-started/installation/#installation-methods), [jq](https://jqlang.org/download/), and python 3.10+.
23+
24+
```bash
25+
git clone git@github.com:scaleapi/mcp-atlas.git
26+
cd mcp-atlas
27+
```
1028

1129
### 1. Configure environment
1230

@@ -38,7 +56,7 @@ make build && make run-docker
3856

3957
This starts the agent-environment service on port 1984 (takes 1+ minute to initialize). Before continuing, please wait for this to finish, you'll see log "Uvicorn running on http://0.0.0.0:1984".
4058

41-
By default, 25 servers that don't require API keys are enabled. Servers requiring API keys are auto-enabled only if you've set their keys in `.env`. To see the enabled mcp servers and confirm they're online: `curl -s http://localhost:1984/enabled-servers | jq -c`
59+
By default, [25 servers](services/agent-environment/src/agent_environment/mcp_client.py#L23) that don't require API keys are enabled. Servers requiring API keys are auto-enabled only if you've set their keys in `.env`. To see the enabled mcp servers and confirm they're online: `curl -s http://localhost:1984/enabled-servers | jq -c`
4260

4361
Optional: to check what tools are available, you can use this CURL script `./services/agent-environment/dev_scripts/debug_and_concurrency_tests/curl_scripts/mcp__list_tools.sh | jq > list_tools.json ; open list_tools.json`
4462

@@ -71,7 +89,7 @@ curl -X POST http://localhost:3000/v2/mcp_eval/run_agent \
7189
cd services/mcp_eval
7290
```
7391

74-
Run the script with a small sample of 10 tasks. This will use the specified input CSV file. It should be solvable with only the 25 MCP servers that don't require any API keys (enabled by default). For details on servers, see `env.template` and `mcp_server_template.json`.
92+
Run the script with a small sample of 10 tasks. This will use the specified input CSV file. It should be solvable with only the 25 MCP servers that don't require any API keys (enabled by default). For details on servers, see `env.template` and [`mcp_server_template.json`](services/agent-environment/src/agent_environment/mcp_server_template.json).
7593

7694
```bash
7795
uv run python mcp_completion_script.py \
@@ -133,7 +151,7 @@ Approximately 18% of evaluation tasks work with the 25 default servers. To run m
133151
- **MongoDB** - Restore `data_exports/mongo_dump_video_game_store-UNZIP-FIRST.zip` (486KB) using `mongorestore`
134152
- **Slack** - Import `data_exports/slack_mcp_eval_export_add100days.zip` (43KB) at your workspace's import page
135153

136-
**See `data_exports/README.md` for detailed setup instructions for each service.** Without this sample data, these MCP servers will still function but may return empty results when evaluation tasks reference specific data.
154+
**See [`data_exports/README.md`](data_exports/README.md) for detailed setup instructions for each service.** Without this sample data, tasks that use these servers will return erroneous results because they cannot find the expected data.
137155

138156
Note: Some services are paid and require billing setup.
139157

@@ -167,7 +185,7 @@ uv run mcp_evals_scores.py \
167185

168186
## What's Included
169187

170-
- **45+ MCP servers** including calculator, Wikipedia, filesystem, Git, weather, GitHub, and more
188+
- **45 MCP servers** including calculator, Wikipedia, filesystem, Git, weather, GitHub, and more
171189
- **Agent completion service** for running multi-turn LLM conversations with tool use
172190
- **Docker containerization** for consistent MCP server environments
173191
- **HTTP APIs** for tool calling and listing available tools

assets/architecture-diagram.png

359 KB
Loading

data_exports/README.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Data Exports for MCP Server Testing (for agent-environment)
1+
# Data Exports for MCP Server Testing
22

33
This directory contains sample data that can be uploaded to various online services to create test environments that match the state used in existing test prompts and evaluations.
44

@@ -20,8 +20,7 @@ To reproduce test results or run evaluations against known data states, you'll n
2020
| Google Calendar | `calendar_mcp_eval_export.zip` | Sample calendar events (unzip as .ics) (8KB) |
2121
| Notion | `notion_mcp_eval_export.zip` | Sample pages and databases (13MB) |
2222
| MongoDB | `mongo_dump_video_game_store.zip` | Sample video game store database (unzip as folder) (486KB) |
23-
| Slack | `slack_mcp_eval_export_add100days.zip` | Sample workspace data (27KB) events timestamped for early Oct 2025 |
24-
| Slack | `slack_mcp_eval_export.zip` | Sample workspace data (27KB) events from late June 2025 (past 90 day window for free slack) |
23+
| Slack | `slack_mcp_eval_export.zip` | Sample workspace data (27KB) events timestamped for early Dec 2025 (slack free accounts hide messages older than 90 days) |
2524

2625
## Setup
2726

13.5 KB
Binary file not shown.
-42.5 KB
Binary file not shown.

services/mcp_eval/sample_tasks.csv

Lines changed: 40 additions & 160 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)