You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+7-2Lines changed: 7 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
MCP-Atlas is a comprehensive benchmark for evaluating AI models' tool-use capabilities across 36 Model Context Protocol (MCP) servers. It provides a standardized environment for running agent completions and evaluating performance with LLM-as-judge methodology.
4
4
5
-
- Paper: [LINK TO PAPER - TODO]
5
+
- Paper: [https://static.scale.com/uploads/674f4cc7a74e35bcaae1c29a/MCP_Atlas.pdf](https://static.scale.com/uploads/674f4cc7a74e35bcaae1c29a/MCP_Atlas.pdf) or ([local copy](assets/MCP_Atlas.pdf))
-`scored_gpt51.csv` - Coverage scores for each task. On Mac, "Numbers" app works better to open CSV files with multi-line rows.
@@ -227,7 +230,9 @@ uv run mcp_evals_scores.py \
227
230
228
231
### 9. Evaluate other models
229
232
230
-
To benchmark other models, repeat step 8 with a different `--model` and `--output`:
233
+
To benchmark other models, repeat step 8 with a different `--model` and `--output`.
234
+
235
+
If you are changing `LLM_API_KEY` you'll also have to restart `make run-mcp-completion`.
231
236
232
237
See [LiteLLM's supported models](https://docs.litellm.ai/docs/providers) for the full list of available providers and model names. For self-hosted models, change `LLM_BASE_URL`.
# System prompt for the model (only used if USE_SYSTEM_PROMPT_IN_COMPLETION=true)
73
73
SYSTEM_PROMPT="Role: You are a factual, tool-aware assistant connected to a variety of tools. Use the available tools to answer the user query. Do not ask the user for clarification; fully complete the task using the information provided in the prompt."
0 commit comments