Skip to content

X-PLUG/OSWorld-MCP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents

🔔 Updates

2025-10-28: We released our paper and project page! 🎉

📄 Read the Paper  |  🌐 Visit the Project Page


📑 Overview & Key Highlights

OSWorld-MCP is a comprehensive and fair benchmark for evaluating computer-use agents in real-world scenarios.
It jointly measures Model Context Protocol (MCP) tool invocation capabilities, graphical user interface (GUI) operation skills, and decision-making performance.
Designed as an extension of OSWorld, it significantly improves realism, balance, and comparability in evaluation.

Key Features & Findings

  • 158 validated MCP tools, spanning 7 common applications (LibreOffice Writer, Calc, Impress, VS Code, Google Chrome, VLC, OS utilities). Among them, 25 distractor tools for robustness testing
  • 250 tool-beneficial tasks → 69% of benchmark tasks benefit from MCP tools
  • Multi-round tool invocation possible, posing real decision-making challenges
  • MCP tools boost model accuracy & efficiency — e.g., OpenAI o3: 8.3% → 20.4% (15 steps)
  • Highest observed Tool Invocation Rate (TIR) = 36.3% (Claude-4-Sonnet, 50 steps) → indicating ample room for improvement
  • MCP tools improve agent metrics
  • Higher tool invocation correlates with higher accuracy
  • Combining tools introduces significant challenges

Architecture Overview

OSWorld-MCP Architecture
Figure: OSWorld-MCP evaluation framework integrating GUI actions and MCP tool invocations.


⚙️ Installation & Usage

1️⃣ Preparation: Code Setup

# Clone OSWorld base repo
git clone https://github.com/xlang-ai/OSWorld.git

# Clone OSWorld-MCP
git clone https://github.com/X-PLUG/OSWorld-MCP.git

Integrate OSWorld-MCP files into OSWorld to enable MCP support.


2️⃣ Preparation: Docker Environment

  1. Copy MCP files into /home inside Docker:
/home/
└── mcp_server/
└── osworld_mcp_client.py
  1. Install dependencies:
pip install -r requirements.txt
  1. Install Node.js
  2. Launch MCP server:
cd mcp_server
bash debug_server.sh

A successful launch opens the local MCP debug UI in your browser.


3️⃣ Running Evaluation

Example: Evaluate Claude 4 Sonnet (15 steps):

python run_multienv_e2e.py \
    --api_url <your_api_url> \
    --api_key <your_api_key> \
    --model 'claude-sonnet-4-20250514-thinking' \
    --test_all_meta_path 'evaluation_examples/test_all.json' \
    --num_envs 1 \
    --action_space mcp \
    --max_steps 15 \
    --max_trajectory_length 15

📐 Key Metrics

  1. Task Accuracy (Acc) — % of tasks successfully completed.
  2. Tool Invocation Rate (TIR) — correct decisions to use a tool or not.
  3. Average Completion Steps (ACS) — average number of actions per completed task.

📊 Leaderboard (Sorted by Accuracy)

🔗 Live Leaderboard: osworld-mcp.github.io

Max Steps: 15

Model / Agent Acc TIR ACS
Agent-S2.5 42.1 30.0 10.0
Claude-4-Sonnet 35.3 30.0 10.4
Seed1.5-VL 32.0 25.1 10.2
Qwen3-VL 31.3 24.5 10.5
Gemini-2.5-Pro 20.5 16.8 11.4
OpenAI o3 20.4 16.7 11.6
Qwen2.5-VL 15.8 13.1 13.5

Max Steps: 50

Model / Agent Acc TIR ACS
Agent-S2.5 49.5 35.3 17.0
Claude-4-Sonnet 43.3 36.6 20.1
Qwen3-VL 39.1 29.5 21.1
Seed1.5-VL 38.4 29.0 23.0
Gemini-2.5-Pro 27.2 21.5 29.7
OpenAI o3 25.2 21.0 32.1
Qwen2.5-VL 14.8 10.9 37.2

📚 Citation

@article{jia2025osworldmcp,
  title={OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents},
  author={Jia, Hongrui and Liao, Jitong and Zhang, Xi and Xu, Haiyang and Xie, Tianbao and Jiang, Chaoya and Yan, Ming and Liu, Si and Ye, Wei and Huang, Fei},
  year={2025},
  journal={arXiv preprint arXiv:2510.24563}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages