Skip to content

Commit 28ec4a3

Browse files
zhangyingerjellytuhahaha
authored andcommitted
Open source deepplanning benchmark, leaderboard, and qwen agent document
1 parent 9fe749d commit 28ec4a3

137 files changed

Lines changed: 31438 additions & 623 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/deploy-docs.yml

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
name: Deploy to GitHub Pages
2+
3+
on:
4+
push:
5+
branches:
6+
- main # 或者你的主分支名称
7+
paths:
8+
- 'qwen-agent-docs/website/**'
9+
workflow_dispatch: # 允许手动触发
10+
11+
permissions:
12+
contents: read
13+
pages: write
14+
id-token: write
15+
16+
# 防止并发部署
17+
concurrency:
18+
group: "pages"
19+
cancel-in-progress: false
20+
21+
jobs:
22+
build:
23+
runs-on: ubuntu-latest
24+
defaults:
25+
run:
26+
working-directory: qwen-agent-docs/website
27+
steps:
28+
- name: Checkout
29+
uses: actions/checkout@v4
30+
31+
- name: Setup Node.js
32+
uses: actions/setup-node@v4
33+
with:
34+
node-version: '18'
35+
cache: 'npm'
36+
cache-dependency-path: 'qwen-agent-docs/website/package-lock.json'
37+
38+
- name: Install dependencies
39+
run: npm ci
40+
41+
- name: Build website
42+
run: npm run build
43+
env:
44+
NODE_ENV: production
45+
46+
- name: Upload artifact
47+
uses: actions/upload-pages-artifact@v3
48+
with:
49+
path: qwen-agent-docs/website/out
50+
51+
deploy:
52+
environment:
53+
name: github-pages
54+
url: ${{ steps.deployment.outputs.page_url }}
55+
runs-on: ubuntu-latest
56+
needs: build
57+
steps:
58+
- name: Deploy to GitHub Pages
59+
id: deployment
60+
uses: actions/deploy-pages@v4
61+

.gitignore

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,3 +43,24 @@ test/*
4343
tests/env.sh
4444
examples/data/*
4545
test.db
46+
47+
benchmark/deepplanning/travelplanning/database/database_en/
48+
benchmark/deepplanning/travelplanning/database/database_zh/
49+
benchmark/deepplanning/travelplanning/.env
50+
__pycache__/
51+
benchmark/deepplanning/travelplanning/CUSTOM_AGENT.md
52+
benchmark/deepplanning/travelplanning/MODEL_CONFIG.md
53+
54+
55+
56+
# Website (Next.js/Node.js)
57+
qwen-agent-docs/website/node_modules/
58+
qwen-agent-docs/website/package-lock.json
59+
qwen-agent-docs/website/.next/
60+
qwen-agent-docs/website/out/
61+
qwen-agent-docs/website/.env*
62+
qwen-agent-docs/website/.temp-source-repo/
63+
qwen-agent-docs/website/.source-docs/
64+
qwen-agent-docs/website/last-sync.json
65+
qwen-agent-docs/website/_pagefind/
66+
qwen-agent-docs/website/*.tsbuildinfo

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,10 @@ limitations under the License.
2222
<br>
2323

2424
<p align="center">
25-
💜 <a href="https://chat.qwen.ai/"><b>Qwen Chat</b></a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://qwenlm.github.io/">Blog</a> &nbsp&nbsp | &nbsp&nbsp📖 <a href="https://qwen.readthedocs.io/">Documentation</a>
25+
💜 <a href="https://chat.qwen.ai/"><b>Qwen Chat</b></a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://qwenlm.github.io/">Blog</a> &nbsp&nbsp | &nbsp&nbsp📖 <a href="https://qwenlm.github.io/Qwen-Agent/en/">Documentation</a>
2626

2727
<br>
28-
💬 <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp🫨 <a href="https://discord.gg/CV4E9rpNSD">Discord</a>&nbsp&nbsp
28+
📊 <a href="https://qwenlm.github.io/Qwen-Agent/en/benchmarks/deepplanning/">Benchmark</a>&nbsp&nbsp | &nbsp&nbsp💬 <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp🫨 <a href="https://discord.gg/CV4E9rpNSD">Discord</a>&nbsp&nbsp
2929
</p>
3030

3131

README_CN.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,10 @@ limitations under the License.
2222
<br>
2323

2424
<p align="center">
25-
💜 <a href="https://chat.qwen.ai/"><b>Qwen Chat</b></a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://qwenlm.github.io/">Blog</a> &nbsp&nbsp | &nbsp&nbsp📖 <a href="https://qwen.readthedocs.io/">Documentation</a>
25+
💜 <a href="https://chat.qwen.ai/"><b>Qwen Chat</b></a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://qwenlm.github.io/">Blog</a> &nbsp&nbsp | &nbsp&nbsp📖 <a href="https://qwenlm.github.io/Qwen-Agent/en/">Documentation</a>
2626

2727
<br>
28-
💬 <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp🫨 <a href="https://discord.gg/CV4E9rpNSD">Discord</a>&nbsp&nbsp
28+
📊 <a href="https://qwenlm.github.io/Qwen-Agent/en/benchmarks/deepplanning/">Benchmark</a>&nbsp&nbsp | &nbsp&nbsp💬 <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp🫨 <a href="https://discord.gg/CV4E9rpNSD">Discord</a>&nbsp&nbsp
2929
</p>
3030

3131
Qwen-Agent是一个开发框架。开发者可基于本框架开发Agent应用,充分利用基于通义千问模型(Qwen)的指令遵循、工具使用、规划、记忆能力。本项目也提供了浏览器助手、代码解释器、自定义助手等示例应用。
@@ -177,7 +177,7 @@ WebUI(bot).run() # bot is the agent defined in the above code, we do not repeat
177177

178178
# FAQ
179179
## 如何使用代码解释器工具?
180-
我们提供了一种基于本地 Docker 容器的代码解释器实现。您可以为智能体启用内置的 `code interpreter` 工具,使其能够根据具体场景自主编写代码,在隔离的沙箱环境中安全执行,并返回执行结果。
180+
我们提供了一种基于本地 Docker 容器的代码解释器实现。您可以为智能体启用内置的 `code interpreter` 工具,使其能够根据具体场景自主编写代码,在隔离的沙箱环境中安全执行,并返回执行结果。
181181
⚠️ **注意**:在使用该工具前,请确保已在本地操作系统上安装并启动 Docker 服务。首次构建容器镜像所需时间取决于您的网络状况。Docker 的安装与配置请参考 [官方文档](https://docs.docker.com/desktop/)
182182

183183
## 如何使用MCP?

benchmark/deepplanning/README.md

Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
# DeepPlanning Benchmark
2+
3+
A comprehensive benchmark for evaluating AI agents' planning capabilities across multiple domains.
4+
5+
## 📋 Overview
6+
7+
This benchmark evaluates AI agents on complex planning tasks across two domains:
8+
9+
- **Travel Planning**: Evaluate agents on travel itinerary planning tasks
10+
- **Shopping Planning**: Evaluate agents on e-commerce shopping tasks
11+
12+
**Flexible Execution:**
13+
- **Unified Run (Recommended)**: You can run both domains together using the unified orchestrator. This documentation focuses on this unified workflow to help you reproduce the experimental results reported in our paper.
14+
- **Independent Run**: Each domain can also be run independently. For domain-specific details, please refer to their respective documentation:
15+
- [`travelplanning/readme.md`](travelplanning/readme.md) - Travel domain details
16+
- [`shoppingplanning/README.md`](shoppingplanning/README.md) - Shopping domain details
17+
18+
## 🚀 Quick Start
19+
20+
### Step 1: Install Dependencies
21+
22+
```bash
23+
# Create and activate conda environment
24+
conda create -n deepplanning python=3.10 -y
25+
conda activate deepplanning
26+
pip install -r requirements.txt
27+
```
28+
29+
### Step 2: Download Data Files
30+
First, download the required data files from [HuggingFace Dataset](https://huggingface.co/datasets/Qwen/DeepPlanning) and place them in the project:
31+
32+
**Shopping Planning:**
33+
- `shoppingplanning/database_zip/database_level1.tar.gz` - Level 1 shopping database
34+
- `shoppingplanning/database_zip/database_level2.tar.gz` - Level 2 shopping database
35+
- `shoppingplanning/database_zip/database_level3.tar.gz` - Level 3 shopping database
36+
37+
**Travel Planning:**
38+
- `travelplanning/database/database_zh.zip` - Chinese database
39+
- `travelplanning/database/database_en.zip` - English database
40+
41+
42+
- In `shoppingplanning/database_zip/`: put `database_level1.tar.gz`, `database_level2.tar.gz`, and `database_level3.tar.gz`.
43+
- In `travelplanning/database/`: put `database_zh.zip` and `database_en.zip`.
44+
45+
46+
### Step 3: Extract Database Files
47+
48+
After downloading, extract the compressed databases:
49+
50+
```bash
51+
# Extract shopping databases
52+
cd shoppingplanning/database_zip
53+
tar -xzf database_level1.tar.gz -C ..
54+
tar -xzf database_level2.tar.gz -C ..
55+
tar -xzf database_level3.tar.gz -C ..
56+
cd ../..
57+
58+
# Extract travel databases
59+
cd travelplanning/database
60+
unzip database_zh.zip # Chinese database (flights, hotels, restaurants, attractions)
61+
unzip database_en.zip # English database
62+
cd ../..
63+
```
64+
65+
### Step 4: Configure Models
66+
67+
Edit `models_config.json` in the project root to add your model configurations:
68+
69+
```json
70+
{
71+
"models": {
72+
"qwen-plus": {
73+
"model_name": "qwen-plus",
74+
"model_type": "openai",
75+
"base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
76+
"api_key_env": "DASHSCOPE_API_KEY",
77+
"temperature": 0.0
78+
},
79+
"gpt-4o-2024-11-20": {
80+
"model_name": "gpt-4o-2024-11-20",
81+
"model_type": "openai",
82+
"base_url": "https://api.openai.com/v1/models",
83+
"api_key_env": "OPENAI_API_KEY",
84+
"temperature": 0.0
85+
}
86+
}
87+
}
88+
```
89+
**Important Note about `qwen-plus`:**
90+
- The `qwen-plus` configuration is **required** because it's used by default in the conversion stage (`evaluation/convert_report.py`) in travel domain to parse and format agent-generated travel plans.
91+
- If you want to use a different model for conversion, you can modify the `conversion_model` variable in `travelplanning/evaluation/convert_report.py`.
92+
93+
### Step 5: Set API Keys
94+
95+
Create a `.env` file in the project root (use `.env.example` as template):
96+
97+
```bash
98+
cp .env.example .env
99+
# Edit .env and add your API keys
100+
```
101+
102+
### Step 6: Run the Unified Benchmark
103+
104+
Edit `run_all.sh` to configure your run:
105+
106+
```bash
107+
# Configuration in run_all.sh
108+
DOMAINS="travel shopping" # Domains to run
109+
BENCHMARK_MODEL="qwen-plus" # Default model for all domains
110+
111+
# Shopping domain configuration
112+
SHOPPING_MODEL="${BENCHMARK_MODEL}" # Model(s) for shopping
113+
SHOPPING_LEVELS="1 2 3" # Levels to run
114+
SHOPPING_WORKERS=50 # Parallel workers
115+
SHOPPING_MAX_LLM_CALLS=400 # Max LLM calls per sample
116+
117+
# Travel domain configuration
118+
TRAVEL_MODEL="${BENCHMARK_MODEL}" # Model(s) for travel
119+
TRAVEL_LANGUAGE="" # Language (zh/en/empty for both)
120+
TRAVEL_WORKERS=50 # Parallel workers
121+
TRAVEL_MAX_LLM_CALLS=400 # Max LLM calls per sample
122+
TRAVEL_START_FROM="inference" # Start point: inference, conversion, evaluation
123+
TRAVEL_OUTPUT_DIR="" # Output directory (optional)
124+
TRAVEL_VERBOSE="false" # Verbose output
125+
TRAVEL_DEBUG="false" # Debug mode
126+
```
127+
128+
Then run:
129+
130+
```bash
131+
bash run_all.sh
132+
```
133+
134+
**What it does:**
135+
1. Runs each model on all specified domains sequentially
136+
2. For **Travel domain**: runs both language versions (Chinese and English)
137+
3. For **Shopping domain**: runs all difficulty levels (1 → 2 → 3)
138+
4. Generates per-domain statistics in domain-specific result folders
139+
5. Aggregates results across domains and calculates overall scores
140+
6. Saves aggregated results in `aggregated_results/{model_name}_aggregated.json`
141+
142+
## 📊 Understanding Results
143+
144+
### Result File Locations
145+
146+
**Travel Domain:**
147+
- Evaluation results: `travelplanning/results/{model}_{language}/evaluation/evaluation_summary.json`
148+
- Converted plans: `travelplanning/results/{model}_{language}/converted_plans/`
149+
- Trajectories: `travelplanning/results/{model}_{language}/trajectories/`
150+
151+
**Shopping Domain:**
152+
- Per-level results: `shoppingplanning/result_report/summary_report_{model}_{level}_{timestamp}.json`
153+
- Overall statistics: `shoppingplanning/result_report/{model}_statistics.json`
154+
- Inference outputs: `shoppingplanning/database_infered/`
155+
156+
157+
158+
**Aggregated Results (Both Domains):**
159+
- Cross-domain aggregation: `aggregated_results/{model}_aggregated.json`
160+
161+
**For detailed domain-specific metrics and result interpretation:**
162+
- **Shopping Domain**: See [Shopping Results Documentation](shoppingplanning/README.md#step-7-view-results) for detailed explanation of match_rate, weighted_average_case_score, and per-level statistics
163+
- **Travel Domain**: See [Travel Results Documentation](travelplanning/readme.md#step-7-view-results) for detailed explanation of composite_score, case_acc, commonsense_score, and personalized_score
164+
165+
### Aggregated Results Format
166+
167+
After running all benchmarks, view the aggregated results:
168+
169+
```bash
170+
cat aggregated_results/{MODEL}_aggregated.json
171+
```
172+
173+
**Example Output:**
174+
```json
175+
{
176+
"model_name": "qwen-plus",
177+
"aggregation_time": "2026-01-05T15:30:00.000000",
178+
"domains": {
179+
"shopping": {
180+
"total_cases": 120,
181+
"successful_cases": 17,
182+
"successful_rate": 0.1417,
183+
"match_rate": 0.6209,
184+
"weighted_average_case_score": 0.1417,
185+
"valid": true,
186+
"levels_completed": [1, 2, 3]
187+
},
188+
"travel": {
189+
"total_cases": 240,
190+
"successful_cases": 238,
191+
"successful_rate": 0.9917,
192+
"composite_score": 0.2813,
193+
"case_acc": 0.0,
194+
"commonsense_score": 0.4292,
195+
"personalized_score": 0.1333,
196+
"valid": true,
197+
"languages_completed": ["zh", "en"],
198+
"language_details": {
199+
"zh": {
200+
"composite_score": 0.2813,
201+
"case_acc": 0.0,
202+
"commonsense_score": 0.4292,
203+
"personalized_score": 0.1333
204+
},
205+
"en": {
206+
"composite_score": 0.2850,
207+
"case_acc": 0.0,
208+
"commonsense_score": 0.4300,
209+
"personalized_score": 0.1350
210+
}
211+
}
212+
}
213+
},
214+
"overall": {
215+
"total_cases": 360,
216+
"successful_cases": 255,
217+
"successful_rate": 0.5667,
218+
"valid": true,
219+
"domains_completed": ["shopping", "travel"],
220+
"num_domains": 2,
221+
"shopping_match_rate": 0.6209,
222+
"shopping_weighted_average_case_score": 0.1417,
223+
"travel_composite_score": 0.2813,
224+
"travel_case_acc": 0.0,
225+
"travel_commonsense_score": 0.4292,
226+
"travel_personalized_score": 0.1333,
227+
"avg_acc": 0.0708
228+
}
229+
}
230+
```
231+
232+
**Key Metrics Overview:**
233+
234+
**Shopping Domain:**
235+
- **`match_rate`** ⭐: Percentage of expected items correctly matched (main paper metric)
236+
- **`weighted_average_case_score`** ⭐: Average case completion score (main paper metric)
237+
238+
**Travel Domain:**
239+
- **`composite_score`** ⭐: Weighted combination of commonsense and personalized scores (main paper metric)
240+
- **`case_acc`** ⭐: Percentage of cases passing all constraints (main paper metric)
241+
- `commonsense_score`: Score for commonsense constraint satisfaction
242+
- `personalized_score`: Score for personalized requirement satisfaction
243+
244+
**Cross-Domain:**
245+
- **`avg_acc`** ⭐: Average of shopping `weighted_average_case_score` and travel `case_acc` - **Primary cross-domain metric**
246+
247+
---
248+
249+
250+
## 📄 License
251+
252+
Please refer to individual domain directories for license information.
253+

0 commit comments

Comments
 (0)