Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

Paper

This repository provides the implementation used in Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking (EACL 2026 Oral Presentation).

Figure. Reasoning-chain faithfulness does not always align with final-answer correctness. (a–b) Visually unfaithful reasoning chains that nonetheless yield correct answers on perception tasks. (c) A visually faithful chain producing an incorrect answer, where the error arises from reasoning rather than perception.

Using this Codebase

Setup

We built our environment with Python 3.9.21.

Install dependencies
```
pip install -r requirements.txt
```
Configure paths and model settings
- Populate the required fields in config.ini:
- project_root: Absolute path to the root directory of this codebase.
- hf_home: Directory used to cache Hugging Face models and datasets.
- batch_size: Batch size for vanilla prompting experiments. For self-reflection, batch size is always 1.
- bedrock_judge_model: Model ID for a Bedrock-based VLM judge. A list of supported models is available here.
- anthropic_version: Specifies the model version to use when running the VLM judge via Amazon Bedrock.
- huggingface_judge_model: Model alias for an open-source VLM judge. Currently supported:
  - qwen → Qwen/Qwen2.5-VL-72B-Instruct
  - llava → llava-hf/llava-v1.6-34b-hf

Running Mitigation Strategy

The proposed mitigation strategy detects unfaithful sentences during generation using Claude 3.7 Sonnet, and then corrects those sentences via a self-reflection mechanism. The implementation of the self-reflection algorithm can be found at:

how_to_intervene/self_reflection.py

Example Commands

To run the mitigation strategy,

python run_mitigation.py \
  --method self-reflection \
  --model_alias thinklite-vl \
  --dataset mmeval-pro-perception \
  --response_filename responses.json

To run the vanilla baseline,

python run_mitigation.py \
  --method vanilla \
  --model_alias thinklite-vl \
  --dataset mmeval-pro-perception \
  --response_filename responses.json

Results are printed to the console. A typical output looks like:

Unfaithful perception steps: 0.0570902394106814 (31/543).
Illogical reasoning steps: 0.059782608695652176 (22/368).
Accuracy: 0.7413793103448276 (129/174).
Faithfulness:
{'total_sentences': 912, 'perception_sentences': 543, 'hallucinated_sentences': 31, 'reasoning_sentences': 368, 'illogical_sentences': 22}

Command-Line Arguments

The main script supports the following parameters:

method: Generation strategy. Options: vanilla, self-reflection.
model_alias: Vision–language model used for generation. These are simplified aliases (not Hugging Face IDs). The mapping from model_alias to Hugging Face identifiers is defined in utils/model.py. Currently supported:
- openvlthinker
- mmeureka
- ocean-r1
- thinklite-vl
dataset: Evaluation dataset to run on. Options: mmeval-pro-perception, hallusionbench, mmvp.
question: Runs single-sample inference using the provided question. If set, this overrides the dataset option.
image_paths: Space-separated list of image file paths corresponding to question. This supersedes the dataset parameter.
response_filename: Name of the JSON file used to store model outputs. All VLM generations, along with the corresponding inputs, are written to this file prior to LLM-based judging. An example of the saved JSON format is shown below:

[
    {
        "query": "How many example pictures have you seen?\n \nA. 6\nB. 8\nC. 10\nD. 12",
        "gt_answer": "B",
        "model_response": "To determine how many example pictures have been shown, let's count them step by step...",
        "judge_response": "Evaluation of Model's Reasoning\n\nSentence 1: \"The first row has 3 example pictures.\"\nType: PERCEPTION\nFaithfulness: FAITHFUL\n\nSentence 2:...",
        "image_base64": "..."
    },
    {
        ...
    }
]

judge_model_category: Judge used for scoring VLM outputs.
- Options: bedrock (default), huggingface.
- If bedrock is selected, the model specified by bedrock_judge_model in config.ini is used.
- If huggingface is selected, the model specified by huggingface_judge_model in config.ini is used.

Citation

If you find our work useful, please cite our paper:

@inproceedings{uppaal2026journey,
  title={Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking},
  author={Uppaal, Rheeya and Htut, Phu Mon and Bai, Min and Pappas, Nikolaos and Qi, Zheng and Swamy, Sandesh},
  booktitle={The 19th Conference of the European Chapter of the Association for Computational Linguistics},
  year={2026}
}

License

This work is licensed under the terms specified in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
figures		figures
how_to_intervene		how_to_intervene
metrics		metrics
prompts		prompts
utils		utils
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
config.ini		config.ini
requirements.txt		requirements.txt
run_mitigation.py		run_mitigation.py
startup.py		startup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

Paper

Using this Codebase

Setup

Running Mitigation Strategy

Example Commands

Command-Line Arguments

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

Paper

Using this Codebase

Setup

Running Mitigation Strategy

Example Commands

Command-Line Arguments

Citation

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages