Skip to content

[BUG] Get low performance when evaluate Qwen3-8B on MMLU-Redux using the thinking mode. #1094

@StarLooo

Description

@StarLooo

Describe the bug

Hello,

I've recently been attempting to evaluate the Qwen3-8B model on the MMLU-Redux benchmark using lighteval.
To support the simultaneous switching between thinking mode and no-thinking mode , I made a minor modification to the lighteval source code: I added an enable_thinking argument during thePromptManager initialization. This allows the corresponding enable_thinking argument to be passed to tokenizer.apply_chat_template() when calling the internal _prepare_chat_template() function.
However, I believe this modification should be irrelevant to the evaluation of MMLU-Redux itself.

In the no-thinking mode, I achieved a 79.7 pass@1 result, which is very close to the 79.5 reported in the official Qwen3 technical report.
The command used was:

LIGHTEVAL_CONFIG="model_name=Qwen/Qwen3-8B,enable_thinking=false,dtype=bfloat16,max_model_length=40960,max_num_batched_tokens=32768,data_parallel_size=1,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.7,top_p:0.8,top_k:20}"
lighteval vllm \
    ${LIGHTEVAL_CONFIG} \
    mmlu_redux_2 \
    --remove-reasoning-tags

However, when I changed enable_thinking from false to true to evaluate the thinking mode, I only obtained a 51.8 pass@1 result. This is a massive drop in performance.

The results on several other benchmarks I've tested (such as IFEval, GSM8K, etc.) did not show a similar inverse contrast. Therefore, I suspect there might be an issue with the specific implementation of the MMLU-Redux evaluation within lighteval, possibly related to how the prompt or answer extraction interacts with the thought process.

Could you please look into this potential issue with the MMLU-Redux evaluation script?

Thank you for your time!

To Reproduce

the default enable_thinking setting for Qwen3 tokenizer is true, so it can be ignored, and one can run evaluation on MMLU-Redux using Qwen3-8B thinking mode with the following cammand:

LIGHTEVAL_CONFIG="model_name=Qwen/Qwen3-8B,dtype=bfloat16,max_model_length=40960,max_num_batched_tokens=32768,data_parallel_size=1,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95,top_k:20}"
lighteval vllm \
    ${LIGHTEVAL_CONFIG} \
    mmlu_redux_2 \
    --remove-reasoning-tags

Other Observation

Upon inspecting the evaluation logs, I noticed a large number of similar extraction fails:

We did not manage to extract a prediction in the correct format. Gold: ['C'], Pred: ["\n\nTo determine the correct conclusion from the given data, we need to apply the **Doppler effect** in the context of light. The Doppler effect describes how the wavelength of light from a moving source changes relative to an observer. This change in wavelength is what allows us to infer the motion of celestial objects.\n\n---\n\n### Key Concept: The Doppler Effect\n\n- **Blueshift**: When a source of light is moving **towards** the observer, the light waves are compressed, resulting in a **shorter wavelength** (blue shift).\n- **Redshift**: When a source of light is moving **away** from the observer, the light waves are stretched, resulting in a **longer wavelength** (red shift).\n\nIn this case:\n- The **rest wavelength** of the hydrogen spectral line is **486.1 nm**.\n- The **observed wavelength** in the star's spectrum is **485.9 nm**.\n\nThis means the observed wavelength is **shorter** than the rest wavelength, indicating a **blueshift**.\n\n---\n\n### Eliminating Other Options\n\n- **Option A: The star is getting hotter.**  \n  A change in the star's temperature would affect the **entire spectrum** (e.g., the blackbody curve), not just a specific spectral line. The position of a specific line (like the hydrogen line) is not directly related to the star's temperature. So this is **not the correct conclusion**.\n\n- **Option B: The star is getting colder.**  \n  Similar to option A, a change in temperature would not cause a specific spectral line to shift in wavelength. This is also **not the correct conclusion**.\n\n- **Option D: The star is moving away from us.**  \n  A redshift (longer wavelength) would indicate the star is moving away. Since the observed wavelength is **shorter**, this is **not the case**.\n\n- **Option C: The star is moving toward us.**  \n  A **blueshift** (shorter wavelength) is the result of the star moving **towards** the observer. This is **consistent** with the observed data.\n\n---\n\n### Final Answer\n\n$$\n\\boxed{C}\n$$"]

Version info

To support enable_thinking , I just install lighteval from source and the version info show its 0.13.1.dev0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions