2025-10-28: We released our paper and project page! 🎉
📄 Read the Paper | 🌐 Visit the Project Page
OSWorld-MCP is a comprehensive and fair benchmark for evaluating computer-use agents in real-world scenarios.
It jointly measures Model Context Protocol (MCP) tool invocation capabilities, graphical user interface (GUI) operation skills, and decision-making performance.
Designed as an extension of OSWorld, it significantly improves realism, balance, and comparability in evaluation.
Key Features & Findings
- 158 validated MCP tools, spanning 7 common applications (LibreOffice Writer, Calc, Impress, VS Code, Google Chrome, VLC, OS utilities). Among them, 25 distractor tools for robustness testing
- 250 tool-beneficial tasks → 69% of benchmark tasks benefit from MCP tools
- Multi-round tool invocation possible, posing real decision-making challenges
- MCP tools boost model accuracy & efficiency — e.g., OpenAI o3: 8.3% → 20.4% (15 steps)
- Highest observed Tool Invocation Rate (TIR) = 36.3% (Claude-4-Sonnet, 50 steps) → indicating ample room for improvement
- MCP tools improve agent metrics
- Higher tool invocation correlates with higher accuracy
- Combining tools introduces significant challenges
Architecture Overview

Figure: OSWorld-MCP evaluation framework integrating GUI actions and MCP tool invocations.
# Clone OSWorld base repo
git clone https://github.com/xlang-ai/OSWorld.git
# Clone OSWorld-MCP
git clone https://github.com/X-PLUG/OSWorld-MCP.gitIntegrate OSWorld-MCP files into OSWorld to enable MCP support.
- Copy MCP files into
/homeinside Docker:
/home/
└── mcp_server/
└── osworld_mcp_client.py
- Install dependencies:
pip install -r requirements.txt- Install Node.js
- Launch MCP server:
cd mcp_server
bash debug_server.shA successful launch opens the local MCP debug UI in your browser.
Example: Evaluate Claude 4 Sonnet (15 steps):
python run_multienv_e2e.py \
--api_url <your_api_url> \
--api_key <your_api_key> \
--model 'claude-sonnet-4-20250514-thinking' \
--test_all_meta_path 'evaluation_examples/test_all.json' \
--num_envs 1 \
--action_space mcp \
--max_steps 15 \
--max_trajectory_length 15- Task Accuracy (Acc) — % of tasks successfully completed.
- Tool Invocation Rate (TIR) — correct decisions to use a tool or not.
- Average Completion Steps (ACS) — average number of actions per completed task.
🔗 Live Leaderboard: osworld-mcp.github.io
Max Steps: 15
| Model / Agent | Acc | TIR | ACS |
|---|---|---|---|
| Agent-S2.5 | 42.1 | 30.0 | 10.0 |
| Claude-4-Sonnet | 35.3 | 30.0 | 10.4 |
| Seed1.5-VL | 32.0 | 25.1 | 10.2 |
| Qwen3-VL | 31.3 | 24.5 | 10.5 |
| Gemini-2.5-Pro | 20.5 | 16.8 | 11.4 |
| OpenAI o3 | 20.4 | 16.7 | 11.6 |
| Qwen2.5-VL | 15.8 | 13.1 | 13.5 |
Max Steps: 50
| Model / Agent | Acc | TIR | ACS |
|---|---|---|---|
| Agent-S2.5 | 49.5 | 35.3 | 17.0 |
| Claude-4-Sonnet | 43.3 | 36.6 | 20.1 |
| Qwen3-VL | 39.1 | 29.5 | 21.1 |
| Seed1.5-VL | 38.4 | 29.0 | 23.0 |
| Gemini-2.5-Pro | 27.2 | 21.5 | 29.7 |
| OpenAI o3 | 25.2 | 21.0 | 32.1 |
| Qwen2.5-VL | 14.8 | 10.9 | 37.2 |
@article{jia2025osworldmcp,
title={OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents},
author={Jia, Hongrui and Liao, Jitong and Zhang, Xi and Xu, Haiyang and Xie, Tianbao and Jiang, Chaoya and Yan, Ming and Liu, Si and Ye, Wei and Huang, Fei},
year={2025},
journal={arXiv preprint arXiv:2510.24563}
}