|
| 1 | +# DeepPlanning Benchmark |
| 2 | + |
| 3 | +A comprehensive benchmark for evaluating AI agents' planning capabilities across multiple domains. |
| 4 | + |
| 5 | +## 📋 Overview |
| 6 | + |
| 7 | +This benchmark evaluates AI agents on complex planning tasks across two domains: |
| 8 | + |
| 9 | +- **Travel Planning**: Evaluate agents on travel itinerary planning tasks |
| 10 | +- **Shopping Planning**: Evaluate agents on e-commerce shopping tasks |
| 11 | + |
| 12 | +**Flexible Execution:** |
| 13 | +- **Unified Run (Recommended)**: You can run both domains together using the unified orchestrator. This documentation focuses on this unified workflow to help you reproduce the experimental results reported in our paper. |
| 14 | +- **Independent Run**: Each domain can also be run independently. For domain-specific details, please refer to their respective documentation: |
| 15 | + - [`travelplanning/readme.md`](travelplanning/readme.md) - Travel domain details |
| 16 | + - [`shoppingplanning/README.md`](shoppingplanning/README.md) - Shopping domain details |
| 17 | + |
| 18 | +## 🚀 Quick Start |
| 19 | + |
| 20 | +### Step 1: Install Dependencies |
| 21 | + |
| 22 | +```bash |
| 23 | +# Create and activate conda environment |
| 24 | +conda create -n deepplanning python=3.10 -y |
| 25 | +conda activate deepplanning |
| 26 | +pip install -r requirements.txt |
| 27 | +``` |
| 28 | + |
| 29 | +### Step 2: Download Data Files |
| 30 | +First, download the required data files from [HuggingFace Dataset](https://huggingface.co/datasets/Qwen/DeepPlanning) and place them in the project: |
| 31 | + |
| 32 | +**Shopping Planning:** |
| 33 | +- `shoppingplanning/database_zip/database_level1.tar.gz` - Level 1 shopping database |
| 34 | +- `shoppingplanning/database_zip/database_level2.tar.gz` - Level 2 shopping database |
| 35 | +- `shoppingplanning/database_zip/database_level3.tar.gz` - Level 3 shopping database |
| 36 | + |
| 37 | +**Travel Planning:** |
| 38 | +- `travelplanning/database/database_zh.zip` - Chinese database |
| 39 | +- `travelplanning/database/database_en.zip` - English database |
| 40 | + |
| 41 | + |
| 42 | +- In `shoppingplanning/database_zip/`: put `database_level1.tar.gz`, `database_level2.tar.gz`, and `database_level3.tar.gz`. |
| 43 | +- In `travelplanning/database/`: put `database_zh.zip` and `database_en.zip`. |
| 44 | + |
| 45 | + |
| 46 | +### Step 3: Extract Database Files |
| 47 | + |
| 48 | +After downloading, extract the compressed databases: |
| 49 | + |
| 50 | +```bash |
| 51 | +# Extract shopping databases |
| 52 | +cd shoppingplanning/database_zip |
| 53 | +tar -xzf database_level1.tar.gz -C .. |
| 54 | +tar -xzf database_level2.tar.gz -C .. |
| 55 | +tar -xzf database_level3.tar.gz -C .. |
| 56 | +cd ../.. |
| 57 | + |
| 58 | +# Extract travel databases |
| 59 | +cd travelplanning/database |
| 60 | +unzip database_zh.zip # Chinese database (flights, hotels, restaurants, attractions) |
| 61 | +unzip database_en.zip # English database |
| 62 | +cd ../.. |
| 63 | +``` |
| 64 | + |
| 65 | +### Step 4: Configure Models |
| 66 | + |
| 67 | +Edit `models_config.json` in the project root to add your model configurations: |
| 68 | + |
| 69 | +```json |
| 70 | +{ |
| 71 | + "models": { |
| 72 | + "qwen-plus": { |
| 73 | + "model_name": "qwen-plus", |
| 74 | + "model_type": "openai", |
| 75 | + "base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1", |
| 76 | + "api_key_env": "DASHSCOPE_API_KEY", |
| 77 | + "temperature": 0.0 |
| 78 | + }, |
| 79 | + "gpt-4o-2024-11-20": { |
| 80 | + "model_name": "gpt-4o-2024-11-20", |
| 81 | + "model_type": "openai", |
| 82 | + "base_url": "https://api.openai.com/v1/models", |
| 83 | + "api_key_env": "OPENAI_API_KEY", |
| 84 | + "temperature": 0.0 |
| 85 | + } |
| 86 | + } |
| 87 | +} |
| 88 | +``` |
| 89 | +**Important Note about `qwen-plus`:** |
| 90 | +- The `qwen-plus` configuration is **required** because it's used by default in the conversion stage (`evaluation/convert_report.py`) in travel domain to parse and format agent-generated travel plans. |
| 91 | +- If you want to use a different model for conversion, you can modify the `conversion_model` variable in `travelplanning/evaluation/convert_report.py`. |
| 92 | + |
| 93 | +### Step 5: Set API Keys |
| 94 | + |
| 95 | +Create a `.env` file in the project root (use `.env.example` as template): |
| 96 | + |
| 97 | +```bash |
| 98 | +cp .env.example .env |
| 99 | +# Edit .env and add your API keys |
| 100 | +``` |
| 101 | + |
| 102 | +### Step 6: Run the Unified Benchmark |
| 103 | + |
| 104 | +Edit `run_all.sh` to configure your run: |
| 105 | + |
| 106 | +```bash |
| 107 | +# Configuration in run_all.sh |
| 108 | +DOMAINS="travel shopping" # Domains to run |
| 109 | +BENCHMARK_MODEL="qwen-plus" # Default model for all domains |
| 110 | + |
| 111 | +# Shopping domain configuration |
| 112 | +SHOPPING_MODEL="${BENCHMARK_MODEL}" # Model(s) for shopping |
| 113 | +SHOPPING_LEVELS="1 2 3" # Levels to run |
| 114 | +SHOPPING_WORKERS=50 # Parallel workers |
| 115 | +SHOPPING_MAX_LLM_CALLS=400 # Max LLM calls per sample |
| 116 | + |
| 117 | +# Travel domain configuration |
| 118 | +TRAVEL_MODEL="${BENCHMARK_MODEL}" # Model(s) for travel |
| 119 | +TRAVEL_LANGUAGE="" # Language (zh/en/empty for both) |
| 120 | +TRAVEL_WORKERS=50 # Parallel workers |
| 121 | +TRAVEL_MAX_LLM_CALLS=400 # Max LLM calls per sample |
| 122 | +TRAVEL_START_FROM="inference" # Start point: inference, conversion, evaluation |
| 123 | +TRAVEL_OUTPUT_DIR="" # Output directory (optional) |
| 124 | +TRAVEL_VERBOSE="false" # Verbose output |
| 125 | +TRAVEL_DEBUG="false" # Debug mode |
| 126 | +``` |
| 127 | + |
| 128 | +Then run: |
| 129 | + |
| 130 | +```bash |
| 131 | +bash run_all.sh |
| 132 | +``` |
| 133 | + |
| 134 | +**What it does:** |
| 135 | +1. Runs each model on all specified domains sequentially |
| 136 | +2. For **Travel domain**: runs both language versions (Chinese and English) |
| 137 | +3. For **Shopping domain**: runs all difficulty levels (1 → 2 → 3) |
| 138 | +4. Generates per-domain statistics in domain-specific result folders |
| 139 | +5. Aggregates results across domains and calculates overall scores |
| 140 | +6. Saves aggregated results in `aggregated_results/{model_name}_aggregated.json` |
| 141 | + |
| 142 | +## 📊 Understanding Results |
| 143 | + |
| 144 | +### Result File Locations |
| 145 | + |
| 146 | +**Travel Domain:** |
| 147 | +- Evaluation results: `travelplanning/results/{model}_{language}/evaluation/evaluation_summary.json` |
| 148 | +- Converted plans: `travelplanning/results/{model}_{language}/converted_plans/` |
| 149 | +- Trajectories: `travelplanning/results/{model}_{language}/trajectories/` |
| 150 | + |
| 151 | +**Shopping Domain:** |
| 152 | +- Per-level results: `shoppingplanning/result_report/summary_report_{model}_{level}_{timestamp}.json` |
| 153 | +- Overall statistics: `shoppingplanning/result_report/{model}_statistics.json` |
| 154 | +- Inference outputs: `shoppingplanning/database_infered/` |
| 155 | + |
| 156 | + |
| 157 | + |
| 158 | +**Aggregated Results (Both Domains):** |
| 159 | +- Cross-domain aggregation: `aggregated_results/{model}_aggregated.json` |
| 160 | + |
| 161 | +**For detailed domain-specific metrics and result interpretation:** |
| 162 | +- **Shopping Domain**: See [Shopping Results Documentation](shoppingplanning/README.md#step-7-view-results) for detailed explanation of match_rate, weighted_average_case_score, and per-level statistics |
| 163 | +- **Travel Domain**: See [Travel Results Documentation](travelplanning/readme.md#step-7-view-results) for detailed explanation of composite_score, case_acc, commonsense_score, and personalized_score |
| 164 | + |
| 165 | +### Aggregated Results Format |
| 166 | + |
| 167 | +After running all benchmarks, view the aggregated results: |
| 168 | + |
| 169 | +```bash |
| 170 | +cat aggregated_results/{MODEL}_aggregated.json |
| 171 | +``` |
| 172 | + |
| 173 | +**Example Output:** |
| 174 | +```json |
| 175 | +{ |
| 176 | + "model_name": "qwen-plus", |
| 177 | + "aggregation_time": "2026-01-05T15:30:00.000000", |
| 178 | + "domains": { |
| 179 | + "shopping": { |
| 180 | + "total_cases": 120, |
| 181 | + "successful_cases": 17, |
| 182 | + "successful_rate": 0.1417, |
| 183 | + "match_rate": 0.6209, |
| 184 | + "weighted_average_case_score": 0.1417, |
| 185 | + "valid": true, |
| 186 | + "levels_completed": [1, 2, 3] |
| 187 | + }, |
| 188 | + "travel": { |
| 189 | + "total_cases": 240, |
| 190 | + "successful_cases": 238, |
| 191 | + "successful_rate": 0.9917, |
| 192 | + "composite_score": 0.2813, |
| 193 | + "case_acc": 0.0, |
| 194 | + "commonsense_score": 0.4292, |
| 195 | + "personalized_score": 0.1333, |
| 196 | + "valid": true, |
| 197 | + "languages_completed": ["zh", "en"], |
| 198 | + "language_details": { |
| 199 | + "zh": { |
| 200 | + "composite_score": 0.2813, |
| 201 | + "case_acc": 0.0, |
| 202 | + "commonsense_score": 0.4292, |
| 203 | + "personalized_score": 0.1333 |
| 204 | + }, |
| 205 | + "en": { |
| 206 | + "composite_score": 0.2850, |
| 207 | + "case_acc": 0.0, |
| 208 | + "commonsense_score": 0.4300, |
| 209 | + "personalized_score": 0.1350 |
| 210 | + } |
| 211 | + } |
| 212 | + } |
| 213 | + }, |
| 214 | + "overall": { |
| 215 | + "total_cases": 360, |
| 216 | + "successful_cases": 255, |
| 217 | + "successful_rate": 0.5667, |
| 218 | + "valid": true, |
| 219 | + "domains_completed": ["shopping", "travel"], |
| 220 | + "num_domains": 2, |
| 221 | + "shopping_match_rate": 0.6209, |
| 222 | + "shopping_weighted_average_case_score": 0.1417, |
| 223 | + "travel_composite_score": 0.2813, |
| 224 | + "travel_case_acc": 0.0, |
| 225 | + "travel_commonsense_score": 0.4292, |
| 226 | + "travel_personalized_score": 0.1333, |
| 227 | + "avg_acc": 0.0708 |
| 228 | + } |
| 229 | +} |
| 230 | +``` |
| 231 | + |
| 232 | +**Key Metrics Overview:** |
| 233 | + |
| 234 | +**Shopping Domain:** |
| 235 | +- **`match_rate`** ⭐: Percentage of expected items correctly matched (main paper metric) |
| 236 | +- **`weighted_average_case_score`** ⭐: Average case completion score (main paper metric) |
| 237 | + |
| 238 | +**Travel Domain:** |
| 239 | +- **`composite_score`** ⭐: Weighted combination of commonsense and personalized scores (main paper metric) |
| 240 | +- **`case_acc`** ⭐: Percentage of cases passing all constraints (main paper metric) |
| 241 | +- `commonsense_score`: Score for commonsense constraint satisfaction |
| 242 | +- `personalized_score`: Score for personalized requirement satisfaction |
| 243 | + |
| 244 | +**Cross-Domain:** |
| 245 | +- **`avg_acc`** ⭐: Average of shopping `weighted_average_case_score` and travel `case_acc` - **Primary cross-domain metric** |
| 246 | + |
| 247 | +--- |
| 248 | + |
| 249 | + |
| 250 | +## 📄 License |
| 251 | + |
| 252 | +Please refer to individual domain directories for license information. |
| 253 | + |
0 commit comments