Skip to content

amazon-science/journey-before-destination

Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

arXiv Project Webpage

Paper

This repository provides the implementation used in Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking (EACL 2026 Oral Presentation).

drawing
Figure. Reasoning-chain faithfulness does not always align with final-answer correctness. (a–b) Visually unfaithful reasoning chains that nonetheless yield correct answers on perception tasks. (c) A visually faithful chain producing an incorrect answer, where the error arises from reasoning rather than perception.

Using this Codebase

Setup

We built our environment with Python 3.9.21.

  1. Install dependencies
    pip install -r requirements.txt
  2. Configure paths and model settings
    • Populate the required fields in config.ini:
    • project_root: Absolute path to the root directory of this codebase.
    • hf_home: Directory used to cache Hugging Face models and datasets.
    • batch_size: Batch size for vanilla prompting experiments. For self-reflection, batch size is always 1.
    • bedrock_judge_model: Model ID for a Bedrock-based VLM judge. A list of supported models is available here.
    • anthropic_version: Specifies the model version to use when running the VLM judge via Amazon Bedrock.
    • huggingface_judge_model: Model alias for an open-source VLM judge. Currently supported:
      • qwenQwen/Qwen2.5-VL-72B-Instruct
      • llavallava-hf/llava-v1.6-34b-hf

Running Mitigation Strategy

The proposed mitigation strategy detects unfaithful sentences during generation using Claude 3.7 Sonnet, and then corrects those sentences via a self-reflection mechanism. The implementation of the self-reflection algorithm can be found at:

how_to_intervene/self_reflection.py

Example Commands

To run the mitigation strategy,

python run_mitigation.py \
  --method self-reflection \
  --model_alias thinklite-vl \
  --dataset mmeval-pro-perception \
  --response_filename responses.json

To run the vanilla baseline,

python run_mitigation.py \
  --method vanilla \
  --model_alias thinklite-vl \
  --dataset mmeval-pro-perception \
  --response_filename responses.json

Results are printed to the console. A typical output looks like:

Unfaithful perception steps: 0.0570902394106814 (31/543).
Illogical reasoning steps: 0.059782608695652176 (22/368).
Accuracy: 0.7413793103448276 (129/174).
Faithfulness:
{'total_sentences': 912, 'perception_sentences': 543, 'hallucinated_sentences': 31, 'reasoning_sentences': 368, 'illogical_sentences': 22}

Command-Line Arguments

The main script supports the following parameters:

  • method: Generation strategy. Options: vanilla, self-reflection.
  • model_alias: Vision–language model used for generation. These are simplified aliases (not Hugging Face IDs). The mapping from model_alias to Hugging Face identifiers is defined in utils/model.py. Currently supported:
    • openvlthinker
    • mmeureka
    • ocean-r1
    • thinklite-vl
  • dataset: Evaluation dataset to run on. Options: mmeval-pro-perception, hallusionbench, mmvp.
  • question: Runs single-sample inference using the provided question. If set, this overrides the dataset option.
  • image_paths: Space-separated list of image file paths corresponding to question. This supersedes the dataset parameter.
  • response_filename: Name of the JSON file used to store model outputs. All VLM generations, along with the corresponding inputs, are written to this file prior to LLM-based judging. An example of the saved JSON format is shown below:
[
    {
        "query": "How many example pictures have you seen?\n \nA. 6\nB. 8\nC. 10\nD. 12",
        "gt_answer": "B",
        "model_response": "To determine how many example pictures have been shown, let's count them step by step...",
        "judge_response": "Evaluation of Model's Reasoning\n\nSentence 1: \"The first row has 3 example pictures.\"\nType: PERCEPTION\nFaithfulness: FAITHFUL\n\nSentence 2:...",
        "image_base64": "..."
    },
    {
        ...
    }
]
  • judge_model_category: Judge used for scoring VLM outputs.
    • Options: bedrock (default), huggingface.
    • If bedrock is selected, the model specified by bedrock_judge_model in config.ini is used.
    • If huggingface is selected, the model specified by huggingface_judge_model in config.ini is used.

Citation

If you find our work useful, please cite our paper:

@inproceedings{uppaal2026journey,
  title={Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking},
  author={Uppaal, Rheeya and Htut, Phu Mon and Bai, Min and Pappas, Nikolaos and Qi, Zheng and Swamy, Sandesh},
  booktitle={The 19th Conference of the European Chapter of the Association for Computational Linguistics},
  year={2026}
}

License

This work is licensed under the terms specified in the LICENSE file.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages