Skip to content

LiuHengyu321/IR3D-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

"What I cannot create, I do not understand." ——Richard Feynman

Arxiv Project Page Video

IR3D Logo

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering
Parker Liu1,*, Chenxin Li1,*, Zhengxin Li2, Yipeng Wu2, Wuyang Li3, Zhiqin Yang1, Zhenyuan Zhang4, Yunlong Lin5, Sirui Han4, Brandon Y. Feng6,
1CUHK, 2TJU, 3EPFL, 4HKUST, 5XMU, 6MIT
NeurIPS DB 2025

🌟 Motivation & Useful Findings

  1. Inspired by Richard Feynman's aphorism (see the header), we propose a new perspective to evaluate VLMs' spatial visual understanding via a pretext task: how well they "recreate this scene."
  2. We found that the aim of scene reconstruction enables VLMs to spontaneously estimate key attributes (object ID, localization, color, material, object relations, etc.) via a inverse rendering fahsion—critical for understanding what they see.
  3. VLMs shows surprising potential for human-like reflection during this "recreation" game: feeding VLMs their recreated scenes, they compare with originals and update their understanding of the secene (the key attributes they estimate). We expect this multi-round feedback iteration to unlock more possibilities for improving existing VLMs in both understanding and generation performance.

🎨 Pipeline Overview

Pipeline

🛠️ Environment setup

(1) Create Environment:

conda create --name ir3d python=3.10
conda activate ir3d

(2) First install vllm

pip install vllm

(3) Install Blender on linux

snap install blender --classic

(4) Install SAM

pip install git+https://github.com/facebookresearch/segment-anything.git

📚 Dataset setup

Download our processed data: IR3D-bench-data.

Inverse Rendering

Task prompt

Prompt for inverse rendering and gpt4o score is in prompts/gpt4o_as_evaluator.txt and prompts/vlm_estimate_params.txt

Latest Proprietary Models

Modified the model-name as defined in main_vllm.py to use the required model.

python main_vllm.py --model-type "model-name"

Open-source Models

Modified the model-name as you needed, such as "gpt-4o", "grok-3", etc.

python main_api.py \ 
    --image_dir /path/to/images \ 
    --result_dir /output/path \ 
    --prompt_path prompts/vlm_estimate_params.txt \ 
    --model_name "model-name"

Eval

bash cal_metric.sh "/output/path" "/path/to/images" "GPI_ID"

🎈 Acknowledgement

Thanks to the following fantastic repos: SAM, vllm, Clever dataset, Blender.

📒 Citation

If you find our work helpful, please consider citing:

@article{liu2025ir3d,
  title={IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering},
  author={Liu, Parker and Li, Chenxin and Li, Zhengxin and Wu, Yipeng and Li, Wuyang and Yang, Zhiqin and Zhang, Zhenyuan and Lin, Yunlong and Han, Sirui and Feng, Brandon Y},
  journal={arXiv preprint arXiv:2506.23329},
  year={2025}
}

About

[NeurIPS DB 2025] IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •