Skip to content

B0B8K1ng/WMNavigation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

arXiv Home Page youtube

Dujun Nie1,*, Xianda Guo2,*, Yiqun Duan3, Ruijun Zhang1, Long Chen1,4,5,†

1 Institute of Automation, Chinese Academy of Sciences, 2 School of Computer Science, Wuhan University, 3 School of Computer Science, University of Technology Sydney, 4 IAIR, Xi'an Jiaotong University, 5 Waytous :email:: [email protected], [email protected], [email protected]

This repository is the official implementation of WMNav, a novel World Model-based Object Goal Navigation framework powered by Vision-Language Models.

Overview of our navigation framework

This project is based on VLMnav. Our method comprises four Three components:
  1. Novel architecture - Introducing a new direction for object goal navigation using a world model consisting of VLMs and novel modules.
  2. Memory strategy - Designing an innovative memory strategy of predicted environmental states that employs an online Curiosity Value Map to quantitatively store the likelihood of the target's presence in various scenarios predicted by the world model.
  3. Efficiency - Proposing a subtask decomposition with feedback and a two-stage action proposer strategy to enhance the reliability of VLM reasoning outcomes and improve exploration efficiency.

🔥 News

  • Mar. 14th, 2025: The code of WMNav is available! ☕️
  • Mar. 4th, 2025: We released our paper on Arxiv.

📚 Table of Contents

🚀 Get Started

⚙ Installation and Setup

  1. clone this repo.
    git clone https://github.com/B0B8K1ng/WMNavigation
    cd WMNav
    
  2. Create the conda environment and install all dependencies.
    conda create -n wmnav python=3.9 cmake=3.14.0
    conda activate wmnav
    conda install habitat-sim=0.3.1 withbullet headless -c conda-forge -c aihabitat
    
    pip install -e .
    
    pip install -r requirements.txt
    

🛢 Prepare Dataset

This project is based on Habitat simulator and the HM3D and MP3D datasets are available here. Our code requires all above data to be in a data folder in the following format. Move the downloaded HM3D v0.1, HM3D v0.2 and MP3D folders into the following configuration:

├── <DATASET_ROOT>
│  ├── hm3d_v0.1/
│  │  ├── val/
│  │  │  ├── 00800-TEEsavR23oF/
│  │  │  │  ├── TEEsavR23oF.navmesh
│  │  │  │  ├── TEEsavR23oF.glb
│  │  ├── hm3d_annotated_basis.scene_dataset_config.json
│  ├── objectnav_hm3d_v0.1/
│  │  ├── val/
│  │  │  ├── content/
│  │  │  │  ├──4ok3usBNeis.json.gz
│  │  │  ├── val.json.gz
│  ├── hm3d_v0.2/
│  │  ├── val/
│  │  │  ├── 00800-TEEsavR23oF/
│  │  │  │  ├── TEEsavR23oF.basis.navmesh
│  │  │  │  ├── TEEsavR23oF.basis.glb
│  │  ├── hm3d_annotated_basis.scene_dataset_config.json
│  ├── objectnav_hm3d_v0.2/
│  │  ├── val/
│  │  │  ├── content/
│  │  │  │  ├──4ok3usBNeis.json.gz
│  │  │  ├── val.json.gz
│  ├── mp3d/
│  │  ├── 17DRP5sb8fy/
│  │  │  ├── 17DRP5sb8fy.glb
│  │  │  ├── 17DRP5sb8fy.house
│  │  │  ├── 17DRP5sb8fy.navmesh
│  │  │  ├── 17DRP5sb8fy_semantic.ply
│  │  ├── mp3d_annotated_basis.scene_dataset_config.json
│  ├── objectnav_mp3d/
│  │  ├── val/
│  │  │  ├── content/
│  │  │  │  ├──2azQ1b91cZZ.json.gz
│  │  │  ├── val.json.gz

The variable DATASET_ROOT can be set in .env file.

🚩 API Key

To use the Gemini VLMs, paste a base url and api key into the .env file for the variable called GEMINI_BASE_URL and GEMINI_API_KEY. You can also try other VLMs by modifying api.py(using the OpenAI libraries)

🎮 Demo

Run the following command to visualize the result of an episode:

python scripts/main.py

In the logs/ directory, there should be saved gifs:


📊 Evaluation

To evaluate WMNav at scale (HM3D v0.1 contains 1000 episodes, 2000 episodes for HM3D v0.2 and 2195 episodes for MP3D), we use a framework for parallel evaluation. The file parallel.sh contains a script to distribute K instances over N GPUs, and for each of them to run M episodes. Note each episode consumes ~320MB of GPU memory. A local flask server is intialized to handle the data aggregation, and then the aggregated results are logged to wandb. Make sure you are logged in with wandb login

This implementation requires tmux to be installed. Please install it via your package manager:

  • Ubuntu/Debian: sudo apt install tmux
  • macOS (with Homebrew): brew install tmux
# parallel.sh
ROOT_DIR=PROJECT_DIR
CONDA_PATH="<user>/miniconda3/etc/profile.d/conda.sh"
NUM_GPU=5
INSTANCES=50
NUM_EPISODES_PER_INSTANCE=20  # 20 for HM3D v0.2, 40 for HM3D v0.1, 44 for MP3D 
MAX_STEPS_PER_EPISODE=40
TASK="ObjectNav"
DATASET="hm3d_v0.2"  # Dataset [hm3d_v0.1, hm3d v0.2, mp3d]
CFG="WMNav"  # Name of config file
NAME="Evaluation"
PROJECT_NAME="WMNav"
VENV_NAME="wmnav" # Name of the conda environment
GPU_LIST=(3 4 5 6 7) # List of GPU IDs to use

results are saved in logs/ directory.

🔨 Customize Experiments

To run your own configuration, please refer to the YAML file detailing the configuration variables:

task: ObjectNav
agent_cls: WMNavAgent # agent class
env_cls: WMNavEnv # env class

agent_cfg:
  navigability_mode: 'depth_sensor' 
  context_history: 0
  explore_bias: 4 
  max_action_dist: 1.7
  min_action_dist: 0.5
  clip_frac: 0.66 # clip action distance to avoid getting too close to obstacles
  stopping_action_dist: 1.5 # length of actions after the agent calls stop
  default_action: 0.2 # how far forward to move if the VLM's chosen action is invalid

💡 If you want to the design your own model(achieved by designing our own CustomAgent and CustomEnv) or try ablation experiments detailed in the paper, please refer to the cumstom_agent.py and custom_env.py.

🙇 Acknowledgement

This work is built on many amazing research works and open-source projects, thanks a lot to all the authors for sharing!

📝 Citation

If you find our work useful in your research, please consider giving a star ⭐ and citing the following paper 📝.

@article{nie2025wmnav,
  title={WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation},
  author={Nie, Dujun and Guo, Xianda and Duan, Yiqun and Zhang, Ruijun and Chen, Long},
  journal={arXiv preprint arXiv:2503.02247},
  year={2025}
}

🤗 Contact

For feedback, questions, or press inquiries please contact [email protected].

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published