This repository provides the implementation used in Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking (EACL 2026 Oral Presentation).

Figure.
Reasoning-chain faithfulness does not always align with final-answer correctness. (a–b) Visually unfaithful reasoning chains that nonetheless yield correct answers on perception tasks. (c) A visually faithful chain producing an incorrect answer, where the error arises from reasoning rather than perception.
We built our environment with Python 3.9.21.
- Install dependencies
pip install -r requirements.txt
- Configure paths and model settings
- Populate the required fields in
config.ini: project_root: Absolute path to the root directory of this codebase.hf_home: Directory used to cache Hugging Face models and datasets.batch_size: Batch size for vanilla prompting experiments. For self-reflection, batch size is always 1.bedrock_judge_model: Model ID for a Bedrock-based VLM judge. A list of supported models is available here.anthropic_version: Specifies the model version to use when running the VLM judge via Amazon Bedrock.huggingface_judge_model: Model alias for an open-source VLM judge. Currently supported:qwen→Qwen/Qwen2.5-VL-72B-Instructllava→llava-hf/llava-v1.6-34b-hf
- Populate the required fields in
The proposed mitigation strategy detects unfaithful sentences during generation using Claude 3.7 Sonnet, and then corrects those sentences via a self-reflection mechanism. The implementation of the self-reflection algorithm can be found at:
how_to_intervene/self_reflection.py
To run the mitigation strategy,
python run_mitigation.py \
--method self-reflection \
--model_alias thinklite-vl \
--dataset mmeval-pro-perception \
--response_filename responses.json
To run the vanilla baseline,
python run_mitigation.py \
--method vanilla \
--model_alias thinklite-vl \
--dataset mmeval-pro-perception \
--response_filename responses.json
Results are printed to the console. A typical output looks like:
Unfaithful perception steps: 0.0570902394106814 (31/543).
Illogical reasoning steps: 0.059782608695652176 (22/368).
Accuracy: 0.7413793103448276 (129/174).
Faithfulness:
{'total_sentences': 912, 'perception_sentences': 543, 'hallucinated_sentences': 31, 'reasoning_sentences': 368, 'illogical_sentences': 22}
The main script supports the following parameters:
method: Generation strategy. Options:vanilla,self-reflection.model_alias: Vision–language model used for generation. These are simplified aliases (not Hugging Face IDs). The mapping from model_alias to Hugging Face identifiers is defined in utils/model.py. Currently supported:openvlthinkermmeurekaocean-r1thinklite-vl
dataset: Evaluation dataset to run on. Options:mmeval-pro-perception,hallusionbench,mmvp.question: Runs single-sample inference using the provided question. If set, this overrides the dataset option.image_paths: Space-separated list of image file paths corresponding to question. This supersedes the dataset parameter.response_filename: Name of the JSON file used to store model outputs. All VLM generations, along with the corresponding inputs, are written to this file prior to LLM-based judging. An example of the saved JSON format is shown below:
[
{
"query": "How many example pictures have you seen?\n \nA. 6\nB. 8\nC. 10\nD. 12",
"gt_answer": "B",
"model_response": "To determine how many example pictures have been shown, let's count them step by step...",
"judge_response": "Evaluation of Model's Reasoning\n\nSentence 1: \"The first row has 3 example pictures.\"\nType: PERCEPTION\nFaithfulness: FAITHFUL\n\nSentence 2:...",
"image_base64": "..."
},
{
...
}
]
judge_model_category: Judge used for scoring VLM outputs.- Options:
bedrock(default),huggingface. - If
bedrockis selected, the model specified bybedrock_judge_modelinconfig.iniis used. - If
huggingfaceis selected, the model specified byhuggingface_judge_modelinconfig.iniis used.
- Options:
If you find our work useful, please cite our paper:
@inproceedings{uppaal2026journey,
title={Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking},
author={Uppaal, Rheeya and Htut, Phu Mon and Bai, Min and Pappas, Nikolaos and Qi, Zheng and Swamy, Sandesh},
booktitle={The 19th Conference of the European Chapter of the Association for Computational Linguistics},
year={2026}
}
This work is licensed under the terms specified in the LICENSE file.