"What I cannot create, I do not understand." ——Richard Feynman
IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering
Parker Liu1,*, Chenxin Li1,*, Zhengxin Li2, Yipeng Wu2, Wuyang Li3, Zhiqin Yang1, Zhenyuan Zhang4, Yunlong Lin5, Sirui Han4, Brandon Y. Feng6,
1CUHK, 2TJU, 3EPFL, 4HKUST, 5XMU, 6MIT
NeurIPS DB 2025
- Inspired by Richard Feynman's aphorism (see the header), we propose a new perspective to evaluate VLMs' spatial visual understanding via a pretext task: how well they "recreate this scene."
- We found that the aim of scene reconstruction enables VLMs to spontaneously estimate key attributes (object ID, localization, color, material, object relations, etc.) via a inverse rendering fahsion—critical for understanding what they see.
- VLMs shows surprising potential for human-like reflection during this "recreation" game: feeding VLMs their recreated scenes, they compare with originals and update their understanding of the secene (the key attributes they estimate). We expect this multi-round feedback iteration to unlock more possibilities for improving existing VLMs in both understanding and generation performance.
(1) Create Environment:
conda create --name ir3d python=3.10
conda activate ir3d(2) First install vllm
pip install vllm
(3) Install Blender on linux
snap install blender --classic(4) Install SAM
pip install git+https://github.com/facebookresearch/segment-anything.gitDownload our processed data: IR3D-bench-data.
Prompt for inverse rendering and gpt4o score is in prompts/gpt4o_as_evaluator.txt and prompts/vlm_estimate_params.txt
Modified the model-name as defined in main_vllm.py to use the required model.
python main_vllm.py --model-type "model-name"Modified the model-name as you needed, such as "gpt-4o", "grok-3", etc.
python main_api.py \
--image_dir /path/to/images \
--result_dir /output/path \
--prompt_path prompts/vlm_estimate_params.txt \
--model_name "model-name"bash cal_metric.sh "/output/path" "/path/to/images" "GPI_ID"Thanks to the following fantastic repos: SAM, vllm, Clever dataset, Blender.
If you find our work helpful, please consider citing:
@article{liu2025ir3d,
title={IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering},
author={Liu, Parker and Li, Chenxin and Li, Zhengxin and Wu, Yipeng and Li, Wuyang and Yang, Zhiqin and Zhang, Zhenyuan and Lin, Yunlong and Han, Sirui and Feng, Brandon Y},
journal={arXiv preprint arXiv:2506.23329},
year={2025}
}
