1 Institute of Automation, Chinese Academy of Sciences, 2 School of Computer Science, Wuhan University, 3 School of Computer Science, University of Technology Sydney, 4 IAIR, Xi'an Jiaotong University, 5 Waytous :email:: [email protected], [email protected], [email protected]
Overview of our navigation framework
This project is based on VLMnav. Our method comprises four Three components:- Novel architecture - Introducing a new direction for object goal navigation using a world model consisting of VLMs and novel modules.
- Memory strategy - Designing an innovative memory strategy of predicted environmental states that employs an online Curiosity Value Map to quantitatively store the likelihood of the target's presence in various scenarios predicted by the world model.
- Efficiency - Proposing a subtask decomposition with feedback and a two-stage action proposer strategy to enhance the reliability of VLM reasoning outcomes and improve exploration efficiency.
Mar. 14th, 2025
: The code of WMNav is available! ☕️Mar. 4th, 2025
: We released our paper on Arxiv.
- clone this repo.
git clone https://github.com/B0B8K1ng/WMNavigation cd WMNav
- Create the conda environment and install all dependencies.
conda create -n wmnav python=3.9 cmake=3.14.0 conda activate wmnav conda install habitat-sim=0.3.1 withbullet headless -c conda-forge -c aihabitat pip install -e . pip install -r requirements.txt
This project is based on Habitat simulator and the HM3D and MP3D datasets are available here. Our code requires all above data to be in a data folder in the following format. Move the downloaded HM3D v0.1, HM3D v0.2 and MP3D folders into the following configuration:
├── <DATASET_ROOT>
│ ├── hm3d_v0.1/
│ │ ├── val/
│ │ │ ├── 00800-TEEsavR23oF/
│ │ │ │ ├── TEEsavR23oF.navmesh
│ │ │ │ ├── TEEsavR23oF.glb
│ │ ├── hm3d_annotated_basis.scene_dataset_config.json
│ ├── objectnav_hm3d_v0.1/
│ │ ├── val/
│ │ │ ├── content/
│ │ │ │ ├──4ok3usBNeis.json.gz
│ │ │ ├── val.json.gz
│ ├── hm3d_v0.2/
│ │ ├── val/
│ │ │ ├── 00800-TEEsavR23oF/
│ │ │ │ ├── TEEsavR23oF.basis.navmesh
│ │ │ │ ├── TEEsavR23oF.basis.glb
│ │ ├── hm3d_annotated_basis.scene_dataset_config.json
│ ├── objectnav_hm3d_v0.2/
│ │ ├── val/
│ │ │ ├── content/
│ │ │ │ ├──4ok3usBNeis.json.gz
│ │ │ ├── val.json.gz
│ ├── mp3d/
│ │ ├── 17DRP5sb8fy/
│ │ │ ├── 17DRP5sb8fy.glb
│ │ │ ├── 17DRP5sb8fy.house
│ │ │ ├── 17DRP5sb8fy.navmesh
│ │ │ ├── 17DRP5sb8fy_semantic.ply
│ │ ├── mp3d_annotated_basis.scene_dataset_config.json
│ ├── objectnav_mp3d/
│ │ ├── val/
│ │ │ ├── content/
│ │ │ │ ├──2azQ1b91cZZ.json.gz
│ │ │ ├── val.json.gz
The variable DATASET_ROOT can be set in .env file.
To use the Gemini VLMs, paste a base url and api key into the .env file for the variable called GEMINI_BASE_URL and GEMINI_API_KEY.
You can also try other VLMs by modifying api.py
(using the OpenAI libraries)
Run the following command to visualize the result of an episode:
python scripts/main.py
In the logs/ directory, there should be saved gifs:
To evaluate WMNav at scale (HM3D v0.1 contains 1000 episodes, 2000 episodes for HM3D v0.2 and 2195 episodes for MP3D), we use a framework for parallel evaluation. The file parallel.sh
contains a script to distribute K instances over N GPUs, and for each of them to run M episodes. Note each episode consumes ~320MB of GPU memory. A local flask server is intialized to handle the data aggregation, and then the aggregated results are logged to wandb. Make sure you are logged in with wandb login
This implementation requires tmux
to be installed. Please install it via your package manager:
- Ubuntu/Debian:
sudo apt install tmux
- macOS (with Homebrew):
brew install tmux
# parallel.sh
ROOT_DIR=PROJECT_DIR
CONDA_PATH="<user>/miniconda3/etc/profile.d/conda.sh"
NUM_GPU=5
INSTANCES=50
NUM_EPISODES_PER_INSTANCE=20 # 20 for HM3D v0.2, 40 for HM3D v0.1, 44 for MP3D
MAX_STEPS_PER_EPISODE=40
TASK="ObjectNav"
DATASET="hm3d_v0.2" # Dataset [hm3d_v0.1, hm3d v0.2, mp3d]
CFG="WMNav" # Name of config file
NAME="Evaluation"
PROJECT_NAME="WMNav"
VENV_NAME="wmnav" # Name of the conda environment
GPU_LIST=(3 4 5 6 7) # List of GPU IDs to use
results are saved in logs/ directory.
To run your own configuration, please refer to the YAML file detailing the configuration variables:
task: ObjectNav
agent_cls: WMNavAgent # agent class
env_cls: WMNavEnv # env class
agent_cfg:
navigability_mode: 'depth_sensor'
context_history: 0
explore_bias: 4
max_action_dist: 1.7
min_action_dist: 0.5
clip_frac: 0.66 # clip action distance to avoid getting too close to obstacles
stopping_action_dist: 1.5 # length of actions after the agent calls stop
default_action: 0.2 # how far forward to move if the VLM's chosen action is invalid
💡 If you want to the design your own model(achieved by designing our own CustomAgent and CustomEnv) or try ablation experiments detailed in the paper, please refer to the cumstom_agent.py
and custom_env.py
.
This work is built on many amazing research works and open-source projects, thanks a lot to all the authors for sharing!
If you find our work useful in your research, please consider giving a star ⭐ and citing the following paper 📝.
@article{nie2025wmnav,
title={WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation},
author={Nie, Dujun and Guo, Xianda and Duan, Yiqun and Zhang, Ruijun and Chen, Long},
journal={arXiv preprint arXiv:2503.02247},
year={2025}
}
For feedback, questions, or press inquiries please contact [email protected].