-
Notifications
You must be signed in to change notification settings - Fork 405
Description
Describe the bug
Hello,
I've recently been attempting to evaluate the Qwen3-8B model on the MMLU-Redux benchmark using lighteval.
To support the simultaneous switching between thinking mode and no-thinking mode , I made a minor modification to the lighteval source code: I added an enable_thinking argument during thePromptManager initialization. This allows the corresponding enable_thinking argument to be passed to tokenizer.apply_chat_template() when calling the internal _prepare_chat_template() function.
However, I believe this modification should be irrelevant to the evaluation of MMLU-Redux itself.
In the no-thinking mode, I achieved a 79.7 pass@1 result, which is very close to the 79.5 reported in the official Qwen3 technical report.
The command used was:
LIGHTEVAL_CONFIG="model_name=Qwen/Qwen3-8B,enable_thinking=false,dtype=bfloat16,max_model_length=40960,max_num_batched_tokens=32768,data_parallel_size=1,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.7,top_p:0.8,top_k:20}"
lighteval vllm \
${LIGHTEVAL_CONFIG} \
mmlu_redux_2 \
--remove-reasoning-tags
However, when I changed enable_thinking from false to true to evaluate the thinking mode, I only obtained a 51.8 pass@1 result. This is a massive drop in performance.
The results on several other benchmarks I've tested (such as IFEval, GSM8K, etc.) did not show a similar inverse contrast. Therefore, I suspect there might be an issue with the specific implementation of the MMLU-Redux evaluation within lighteval, possibly related to how the prompt or answer extraction interacts with the thought process.
Could you please look into this potential issue with the MMLU-Redux evaluation script?
Thank you for your time!
To Reproduce
the default enable_thinking setting for Qwen3 tokenizer is true, so it can be ignored, and one can run evaluation on MMLU-Redux using Qwen3-8B thinking mode with the following cammand:
LIGHTEVAL_CONFIG="model_name=Qwen/Qwen3-8B,dtype=bfloat16,max_model_length=40960,max_num_batched_tokens=32768,data_parallel_size=1,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95,top_k:20}"
lighteval vllm \
${LIGHTEVAL_CONFIG} \
mmlu_redux_2 \
--remove-reasoning-tags
Other Observation
Upon inspecting the evaluation logs, I noticed a large number of similar extraction fails:
We did not manage to extract a prediction in the correct format. Gold: ['C'], Pred: ["\n\nTo determine the correct conclusion from the given data, we need to apply the **Doppler effect** in the context of light. The Doppler effect describes how the wavelength of light from a moving source changes relative to an observer. This change in wavelength is what allows us to infer the motion of celestial objects.\n\n---\n\n### Key Concept: The Doppler Effect\n\n- **Blueshift**: When a source of light is moving **towards** the observer, the light waves are compressed, resulting in a **shorter wavelength** (blue shift).\n- **Redshift**: When a source of light is moving **away** from the observer, the light waves are stretched, resulting in a **longer wavelength** (red shift).\n\nIn this case:\n- The **rest wavelength** of the hydrogen spectral line is **486.1 nm**.\n- The **observed wavelength** in the star's spectrum is **485.9 nm**.\n\nThis means the observed wavelength is **shorter** than the rest wavelength, indicating a **blueshift**.\n\n---\n\n### Eliminating Other Options\n\n- **Option A: The star is getting hotter.** \n A change in the star's temperature would affect the **entire spectrum** (e.g., the blackbody curve), not just a specific spectral line. The position of a specific line (like the hydrogen line) is not directly related to the star's temperature. So this is **not the correct conclusion**.\n\n- **Option B: The star is getting colder.** \n Similar to option A, a change in temperature would not cause a specific spectral line to shift in wavelength. This is also **not the correct conclusion**.\n\n- **Option D: The star is moving away from us.** \n A redshift (longer wavelength) would indicate the star is moving away. Since the observed wavelength is **shorter**, this is **not the case**.\n\n- **Option C: The star is moving toward us.** \n A **blueshift** (shorter wavelength) is the result of the star moving **towards** the observer. This is **consistent** with the observed data.\n\n---\n\n### Final Answer\n\n$$\n\\boxed{C}\n$$"]
Version info
To support enable_thinking , I just install lighteval from source and the version info show its 0.13.1.dev0