[BUG] Get low performance when evaluate Qwen3-8B on MMLU-Redux using the thinking mode.

## Describe the bug
Hello,

I've recently been attempting to evaluate the `Qwen3-8B` model on the `MMLU-Redux` benchmark using `lighteval`.
To support the simultaneous switching between `thinking mode` and `no-thinking mode` , I made a minor modification to the `lighteval` source code: I added an `enable_thinking` argument during the`PromptManager` initialization. This allows the corresponding `enable_thinking` argument to be passed to `tokenizer.apply_chat_template()` when calling the internal `_prepare_chat_template()` function.
However, I believe this modification should be irrelevant to the evaluation of `MMLU-Redux` itself.

In the `no-thinking mode`, I achieved a **79.7** pass@1 result, which is very close to the **79.5** reported in the official Qwen3 technical report.
The command used was:
```
LIGHTEVAL_CONFIG="model_name=Qwen/Qwen3-8B,enable_thinking=false,dtype=bfloat16,max_model_length=40960,max_num_batched_tokens=32768,data_parallel_size=1,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.7,top_p:0.8,top_k:20}"
lighteval vllm \
    ${LIGHTEVAL_CONFIG} \
    mmlu_redux_2 \
    --remove-reasoning-tags
```

However, when I changed `enable_thinking` from `false` to `true` to evaluate the `thinking mode`, I only obtained a **51.8** pass@1 result. This is a massive drop in performance.

The results on several other benchmarks I've tested (such as `IFEval`, `GSM8K`, etc.) did not show a similar inverse contrast. Therefore, I suspect there might be an issue with the specific implementation of the `MMLU-Redux` evaluation within lighteval, possibly related to how the prompt or answer extraction interacts with the thought process.

Could you please look into this potential issue with the `MMLU-Redux` evaluation script?

Thank you for your time!

## To Reproduce
the default `enable_thinking` setting for Qwen3 tokenizer is true, so it can be ignored, and one can run evaluation on `MMLU-Redux` using `Qwen3-8B` `thinking mode` with the following cammand:
```
LIGHTEVAL_CONFIG="model_name=Qwen/Qwen3-8B,dtype=bfloat16,max_model_length=40960,max_num_batched_tokens=32768,data_parallel_size=1,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95,top_k:20}"
lighteval vllm \
    ${LIGHTEVAL_CONFIG} \
    mmlu_redux_2 \
    --remove-reasoning-tags
```

## Other Observation
Upon inspecting the evaluation logs, I noticed a large number of similar extraction fails: 
```
We did not manage to extract a prediction in the correct format. Gold: ['C'], Pred: ["\n\nTo determine the correct conclusion from the given data, we need to apply the **Doppler effect** in the context of light. The Doppler effect describes how the wavelength of light from a moving source changes relative to an observer. This change in wavelength is what allows us to infer the motion of celestial objects.\n\n---\n\n### Key Concept: The Doppler Effect\n\n- **Blueshift**: When a source of light is moving **towards** the observer, the light waves are compressed, resulting in a **shorter wavelength** (blue shift).\n- **Redshift**: When a source of light is moving **away** from the observer, the light waves are stretched, resulting in a **longer wavelength** (red shift).\n\nIn this case:\n- The **rest wavelength** of the hydrogen spectral line is **486.1 nm**.\n- The **observed wavelength** in the star's spectrum is **485.9 nm**.\n\nThis means the observed wavelength is **shorter** than the rest wavelength, indicating a **blueshift**.\n\n---\n\n### Eliminating Other Options\n\n- **Option A: The star is getting hotter.**  \n  A change in the star's temperature would affect the **entire spectrum** (e.g., the blackbody curve), not just a specific spectral line. The position of a specific line (like the hydrogen line) is not directly related to the star's temperature. So this is **not the correct conclusion**.\n\n- **Option B: The star is getting colder.**  \n  Similar to option A, a change in temperature would not cause a specific spectral line to shift in wavelength. This is also **not the correct conclusion**.\n\n- **Option D: The star is moving away from us.**  \n  A redshift (longer wavelength) would indicate the star is moving away. Since the observed wavelength is **shorter**, this is **not the case**.\n\n- **Option C: The star is moving toward us.**  \n  A **blueshift** (shorter wavelength) is the result of the star moving **towards** the observer. This is **consistent** with the observed data.\n\n---\n\n### Final Answer\n\n$$\n\\boxed{C}\n$$"]
```

## Version info
To support `enable_thinking` , I just install `lighteval` from source and the version info show its `0.13.1.dev0`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Get low performance when evaluate Qwen3-8B on MMLU-Redux using the thinking mode. #1094

Describe the bug

To Reproduce

Other Observation

Version info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Get low performance when evaluate Qwen3-8B on MMLU-Redux using the thinking mode. #1094

Description

Describe the bug

To Reproduce

Other Observation

Version info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions