Skip to content

The-Martyr/Awesome-Multimodal-Reasoning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Awesome-Multimodal-Reasoning Awesome

This is a repository for organizing papres related to Multimodal Reasoning in Multimodal Large Language Models (Image, Video).

With the development of the visual (audio) capabilities and reasoning capabilities (RL powered) of multimodal large language models(MLLMs/LVLMs/LSLMs), researchers have high hopes for the multimodal reasoning capabilities of MLLM/LVLM/LSLM.

This repo also select paper about visual generation (image generation/video generation) with RL/CoT.

⭐ If you find this list useful, welcome to star it!

Paper List (Updating...)

Survey

(8 May 2025) Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models arXiv

(30 Apr 2025) Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models arXiv

(4 Apr 2025) Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning arXiv

(18 Mar 2025) Aligning Multimodal LLM with Human Preference: A Survey arXiv

(16 Mar 2025) Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey arXiv

Image Reasoning

(29 Oct 2025) PairUni: Pairwise Training for Unified Multimodal Language Models arXiv

(27 Oct 2025) VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation arXiv

(23 Oct 2025) Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning arXiv

(23 Oct 202) Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation arXiv

(18 Oct 2025) RL makes MLLMs see better than SFT arXiv

(16 Oct 2025) MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning arXiv

(15 Oct 2025) Generative Universal Verifier as Multimodal Meta-Reasoner arXiv

(14 Oct 2025) HoneyBee: Data Recipes for Vision-Language Reasoners arXiv

(14 Oct 2025) DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search arXiv

(10 Oct 2025) Unleashing Perception-Time Scaling to Multimodal Reasoning Models arXiv

(10 Oct 2025) Spotlight on Token Perception for Multimodal Reinforcement Learning arXiv

(10 Oct 2025) Tiny-R1V: Lightweight Multimodal Unified Reasoning Model via Model Merging arXiv

(13 Oct 2025) CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images arXiv

(9 Oct 2025) ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping arXiv

(9 Oct 2025) SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models arXiv

(7 Oct 2025) Context Matters: Learning Global Semantics via Object-Centric Representation arXiv

(6 Oct 2025) Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment arXiv

(3 Oct 2025) Efficient Test-Time Scaling for Small Vision-Language Models arXiv

(27 Sep 2025) Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning arXiv

(29 Sep 2025) Latent Visual Reasoning arXiv

(29 Sep 2025) GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning arXiv

(28 Sep 2025) Poivre: Self-Refining Visual Pointing with Reinforcement Learning arXiv

(29 Sep 2025) VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding arXiv

(29 Sep 2025) Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks arXiv

(25 Sep 2025) MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources arXiv

(12 Sep 2025) LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA arXiv

(9 Sep 2025) Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search arXiv

(28 Aug 2025) R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning arXiv

(27 Aug 2025) Self-Rewarding Vision-Language Model via Reasoning Decomposition arXiv

(18 Aug 2025) M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following arXiv

(18 Aug 2025) Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation arXiv

(18 Aug 2025) Ovis2.5 Technical Report arXiv

(18 Aug 2025) MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models arXiv

(8 Aug 2025) SIFThinker: Spatially-Aware Image Focus for Visual Reasoning arXiv

(7 Aug 2025) Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision arXiv

(7 Aug 2025) StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models arXiv

(5 Aug 2025) Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions arXiv

(30 Jul 2025) MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention arXiv

(28 Jul 2025) Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback arXiv

(24 Jul 2025) MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning arXiv

(24 Jul 2025) SafeWork-R1: Coevolving Safety and Intelligence under the AI-45 Law arXiv

(22 Jul 2025) C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning arXiv

(22 Jul 2025) Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning arXiv

(11 Jul 2025) M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning arXiv

(3 Jul 2025) Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation arXiv

(1 Jul 2025) GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning arXiv

(20 Jun 2025) GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning arXiv

(16 Jun 2025) Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning arXiv

(11 Jun 2025) ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs arXiv

(5 Jun 2025) Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning arXiv

(5 Jun 2025) Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos arXiv

(5 Jun 2025) MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning arXiv

(16 May 2025) Visual Planning: Let's Think Only with Images arXiv

(15 May 2025) MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning arXiv

(13 May 2025) OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning arXiv

(12 May 2025) Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning arXiv

(8 May 2025) Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging arXiv

( 8 May 2025) SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models arXiv

(6 May 2025) X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains arXiv

(6 May 2025) Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning arXiv

(6 May 2025) ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant arXiv

(5 May 2025) R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning arXiv

(28 Apr 2025) SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning arXiv

(25 Apr 2025) Fast-Slow Thinking for Large Vision-Language Model Reasoning arXiv

(25 Apr 2025) Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization arXiv

(25 Apr 2025) Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning arXiv

(21 Apr 2025) A Call for New Recipes to Enhance Spatial Reasoning in MLLMs arXiv

(20 Apr 2025) Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension arXiv

(12 Apr 2025) VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search arXiv

(10 Apr 2025) VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model arXiv

(10 Apr 2025) SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement arXiv

(10 Apr 2025) Perception-R1: Pioneering Perception Policy with Reinforcement Learning arXiv

(10 Apr 2025) Kimi-VL Technical Report arXiv

(8 Apr 2025) On the Suitability of Reinforcement Fine-Tuning to Visual Tasks arXiv

(8 Apr 2025) Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought arXiv

(1 Apr 2025) Improved Visual-Spatial Reasoning via R1-Zero-Like Training arXiv

(17 Mar 2025) R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization arXiv

(13 Mar 2025) VisualPRM: An Effective Process Reward Model for Multimodal Reasoning arXiv

(9 Mar 2025) Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models arXiv

(7 Mar 2025) R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning arXiv

(7 Mar 2025) Unified Reward Model for Multimodal Understanding and Generation arXiv

(7 Mar 2025) R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model arXiv

(3 Mar 2025) Visual-RFT: Visual Reinforcement Fine-Tuning arXiv

(4 Feb 2025) Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking arXiv

(3 Jan 2025) Virgo: A Preliminary Exploration on Reproducing o1-like MLLM arXiv

(13 Jan 2025) Imagine while Reasoning in Space: Multimodal Visualization-of-Thought arXiv

(10 Jan 2025) LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs arXiv

(9 Jan 2025) Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark arXiv

(30 Dec 2024) Slow Perception: Let's Perceive Geometric Figures Step-by-step arXiv

(19 Dec 2024) Progressive Multimodal Reasoning via Active Retrieval arXiv

(29 Nov 2024) Interleaved-Modal Chain-of-Thought arXiv

(15 Nov 2024) Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination arXiv

(15 Nov 2024) LLaVA-CoT: Let Vision Language Models Reason Step-by-Step arXiv

(30 Oct 2024) Vision-Language Models Can Self-Improve Reasoning via Reflection arXiv

(23 Oct 2024) R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models arXiv

(21 Oct 2024) Improve Vision Language Model Chain-of-thought Reasoning arXiv

(11 Oct 2024) M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought arXiv

(6 Oct 2024) MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration arXiv

(4 Oct 2024) Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning arXiv

(29 Sep 2024) CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought arXiv

(13 Jun 2024) Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models arXiv

(28 Dec 2023) Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos arXiv

(14 Dec 2023) Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models arXiv

(27 Nov 2023) Compositional Chain-of-Thought Prompting for Large Multimodal Models arXiv

(15 Nov 2023) The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task arXiv

(3 May 2023) Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings arXiv

(16 Apr 2023) Chain of Thought Prompt Tuning in Vision Language Models arXiv

(2 Feb 2023) Multimodal Chain-of-Thought Reasoning in Language Models arXiv

Video

(23 Oct 2025) Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence arXiv

(9 Oct 2025) SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models arXiv

(6 Oct 202) Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models arXiv

(5 Oct 2025) Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning arXiv

(29 Sep 2025) FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting arXiv

(29 Sep 2025) LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning arXiv

(28 Sep 2025) FrameMind: Frame-Interleaved Chain-of-Thought for Video Reasoning via Reinforcement Learning arXiv

(12 Jun 2025) CogStream: Context-guided Streaming Video Question Answering arXiv

(6 Jun 2025) VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning arXiv

(27 Mar 2025) Video-R1: Reinforcing Video Reasoning in MLLMs arXiv

(17 Feb 2025) video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model arXiv

(10 Feb 2025) CoS: Chain-of-Shot Prompting for Long Video Understanding arXiv

(8 Jan 2025) Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs arXiv

(3 Dec 2024) VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation arXiv

(2 Dec 2024) Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation arXiv

(29 Nov 2024) STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training arXiv

(21 Oct 2024) Improve Vision Language Model Chain-of-thought Reasoning arXiv

(12 Oct 2024) Interpretable Video based Stress Detection with Self-Refine Chain-of-thought Reasoning arXiv

(27 Sep 2024) Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks arXiv

(28 Aug 2024) Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation arXiv

(24 May 2024) Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models arXiv

(7 May 2024) Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition. arXiv code

(8 Oct 2024) Temporal Reasoning Transfer from Text to Video. arXiv

DLLM

(9 Oct 2025) Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization arXiv

(9 Oct 2025) Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization arXiv

Audio

(23 Oct 2025) Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards arXiv

(10 Oct 2025) Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models arXiv

(8 Oct 2025) Can Speech LLMs Think while Listening? arXiv

(5 Oct 2025) Principled and Tractable RL for Reasoning with Diffusion Language Models arXiv

(22 Jul 2025) Step-Audio 2 Technical Report arXiv

(14 Mar 2025) Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering arXiv

Image/Video Generation

(24 Oct 2025) Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation arXiv

(15 Oct 2025) Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation arXiv

(9 Oct 2025) Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing arXiv

(9 Oct 2025) Reinforcing Diffusion Models by Direct Group Preference Optimization arXiv

(9 Oct 2025) Real-Time Motion-Controllable Autoregressive Video Diffusion arXiv

(29 Sep 2025) STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation arXiv

(28 Aug 2025) Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance arXiv

(28 Aug 2025) OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning arXiv

(28 Aug 2025) Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning arXiv

(27 Aug 2025) CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning arXiv

(9 Aug 2025) AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning arXiv

(28 Jul 2025) Multimodal LLMs as Customized Reward Models for Text-to-Image Generation arXiv

(20 Jun 2025) RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought arXiv

(17 Jun 2025) SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks arXiv

(16 May 2025) Towards Self-Improvement of Diffusion Models via Group Preference Optimization arXiv

(16 May 2025) Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models arXiv

(15 May 2025) Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models arXiv

(12 May 2025) DanceGRPO: Unleashing GRPO on Visual Generation arXiv

(8 May 2025) Flow-GRPO: Training Flow Matching Models via Online RL arXiv

(1 May 2025) T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT arXiv

(22 Apr 2025) From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning arXiv

(22 Apr 2025) Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning arXiv

(26 Mar 2025) MMGen: Unified Multi-modal Image Generation and Understanding in One Go arXiv

(13 Mar 2025) GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing arXiv

(3 Mar 2025) MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation arXiv

(23 Jan 2025) Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step arXiv

Bench/Dataset

(15 Oct 2025) Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models arXiv

(14 Oct 2025) Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning arXiv

(10 Oct 2025) BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception arXiv

(10 Oct 2025) SpaceVista: All-Scale Visual Spatial Reasoning from mm to km arXiv

(9 Sep 2025) Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images arXiv

(27 Aug 2025) 11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis arXiv

(8 Aug 2025) MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models arXiv

(8 Aug 2025) InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic? arXiv

(22 Jul 2025) ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering arXiv

(22 Jul 2025) Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning arXiv

(12 Jun 2025) VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos arXiv

(12 Jun 2025) MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning arXiv

(6 Jun 2025) PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts arXiv

(5 Jun 2025) VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos arXiv

(5 Jun 2025) MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark arXiv

(15 May 2025) StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation arXiv

(13 May 2025) VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models arXiv

(1 May 2025) MINERVA: Evaluating Complex Video Reasoning arXiv

(30 Apr 2025) GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling arXiv

(21 Apr 2025) IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs arXiv

(21 Apr 2025) VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models arXiv

(17 Apr 2025) Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark arXiv

(16 Apr 2025) FLIP Reasoning Challenge arXiv

(14 Apr 2025) VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge arXiv

(8 Apr 2025) ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering arXiv

(8 Apr 2025) V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models arXiv

(8 Apr 2025) MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models arXiv

(4 Apr 2025) Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme arXiv

(15 Feb 2025) SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding arXiv

(14 Feb 2025) MM-RLHF: The Next Step Forward in Multimodal LLM Alignment arXiv

(13 Feb 2025) MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency arXiv

(18 Dec 2024) Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces. arXiv

(22 Nov 2024) VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection. arXiv code

(18 Oct 2024) MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps arXiv

(7 Jul 2024) VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool arXiv

(20 Jun 2024) MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding arXiv

(12 Jun 2024) LVBench: An Extreme Long Video Understanding Benchmark arXiv

(24 Apr 2024) Cantor: Inspiring Multimodal Chain-of-Thought of MLLM arXiv

(16 Apr 2024) OpenEQA: Embodied Question Answering in the Era of Foundation Models arXiv

(17 Aug 2023) EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding arXiv

(23 May 2023) Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought. arXiv

(18 May 2021) NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions arXiv

Latent

(29 Sep 2025) Latent Visual Reasoning arXiv

(12 Feb 2025) Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning arXiv

(7 Feb 2025) Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach arXiv

(9 Dec 2024) Training Large Language Models to Reason in a Continuous Latent Space arXiv

Open Source Project

https://github.com/Hui-design/Open-LLaVA-Video-R1

https://github.com/SkyworkAI/Skywork-R1V

https://huggingface.co/papers/2503.05379

https://github.com/Osilly/Vision-R1

https://github.com/ModalMinds/MM-EUREKA

https://github.com/OpenRLHF/OpenRLHF-M

https://github.com/Fancy-MLLM/R1-Onevision

https://github.com/om-ai-lab/VLM-R1

https://github.com/EvolvingLMMs-Lab/open-r1-multimodal

https://github.com/Deep-Agent/R1-V

https://github.com/TideDra/lmm-r1

https://github.com/tulerfeng/Video-R1

https://github.com/Wang-Xiaodong1899/Open-R1-Video