Video procedure planning, i.e., planning a sequence of action steps given the video frames of start and goal states, is an essential ability for embodied AI. Recent works utilize Large Language Models (LLMs) to generate enriched action step description texts to guide action step decoding. Although introducing LLMs, these methods decode the action steps into a close set of one-hot vectors, limiting the model's capability of generalizing to new steps or tasks. Additionally, the fixed action step descriptions based on world-level commonsense may contain noise in specific samples of visual states. In this paper, we propose PlanLLM, a cross-modal joint learning framework with LLMs for video procedure planning. We propose LLM-Enhanced Planning module which fully use the generalization ability of LLMs to produce free form planning outputs and to enhance action step decoding. We also propose Mutual Information Maximization module to connect world-level commonsense of step descriptions and sample-specific information of visual states, enabling LLMs to employ the reasoning ability to generate step sequence. With the assistance of LLMs, our method can deal with both close set and open vocabulary procedure planning tasks. Our PlanLLM achieves superior performance on three benchmarks, demonstrating the effectiveness of our designs.
Either creating manually:
git clone https://github.com/idejie/PlanLLM.git
cd PlanLLM
export PYTHONPATH=<YOUR_PROJECT_PATH>
conda create -n PlanLLM python==3.12
conda activate PlanLLM
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
conda install tensorboardX pandas ftfy regex
pip install loguru timm peft opencv_python fvcore transformers imageio wandb sentencepiece einops scipy
MAX_JOBS=16 pip install flash-attn --no-build-isolation # wait a long time to build the dependency...
Note: (1) put them into pretrained
directory; (2)You can use the https://hf-mirror.com/
in China
Download pre-extracted HowTo100M features
# CrossTask
bash scripts/dataset/download_crosstask.sh
# COIN
bash scripts/dataset/download_coin.sh
# NIV
bash scripts/dataset/download_niv.sh
You can download the features from SCHEMA ( both data and dataset dirs)
The descriptions of actions and states have been already provided in this repo. The raw descriptions are saved as .json files in the "data" folder. The state and action description features extracted by CLIP language encoder are saved respectively in the "data/state_description_features" and "data/action_description_features" folders.
If you want to customize the prompts and generate new descriptions, please follow the steps below:
-
Modify line 9 of generate_descriptors.py, set the variable openai_key to your OpenAI key.
-
Modify the prompt starting from line 25 of generate_descriptors.py.
-
Download OpenAI package and generate description files:
pip install openai python tools/generate_descriptors.py --dataset [DATASET]
Note: Replace the [DATASET] with a specific dataset: crosstask or coin or niv. (Same for the following steps)
-
Extract description features:
python tools/extract_description_feature.py --dataset [DATASET]
- modify the variable dataset(
niv
orcrosstask
orcoin
) and max_traj_len(time_horizon=3,4) inscripts/planllm_vicuna/config_qformer.py
- run the training script:
bash scripts/planllm_vicuna/run_qformer.py
- modify the variable dataset(
niv
orcrosstask
orcoin
) and max_traj_len(time_horizon=3,4) inscripts/planllm_vicuna/config_llm.py
- modify the variable vit_blip_model_path in
scripts/planllm_vicuna/config_llm.py
to your checkpoints path
Dataset | Success Rate | Checkpoints |
---|---|---|
niv(T=3) | 30.63 | BaiduNetdisk |
niv(T=4) | 24.81 | BaiduNetdisk |
crosstask(T=3) | 40.01 | BaiduNetdisk |
crosstask(T=4) | 25.61 | BaiduNetdisk |
coin(T=3) | 33.22 | BaiduNetdisk |
coin(T=4) | 25.13 | BaiduNetdisk |
run the training script:
bash scripts/planllm_vicuna/run_llm.py
Please consider citing our paper if it helps your research.
@inproceedings{PlanLLM,
title={PlanLLM: Video Procedure Planning with Refinable Large Language Models},
author={Dejie Yang, Zijing Zhao, and Yang Liu},
year={2025},
booktitle={AAAI},
}