Description
Description
When tools are included in the API request payload, Qwen3.5-27B reduces tihinking tokens --- the<think> block only contains a few sentences. The same prompt without tools produces 45-65 seconds of deep, multi-step reasoning. This is not a gradual degradation — it is a binary on/off behavior that makes it impossible to use tool-calling and reasoning together.
This severely limits agentic use cases where the model needs to think deeply about a problem before deciding whether to call a tool or answer directly.
Analysis
The chat template's tool instruction block contains:
"If you choose to call a function ONLY reply in the following format with NO suffix"
"You may provide optional reasoning for your function call"
We patched these instructions to encourage reasoning ("Think carefully and reason thoroughly..."), but the behavior did not change. This confirms the suppression is not driven by the chat template instructions — it is a trained behavior embedded in the model weights. When the model sees <tools> in its tokenized input, it switches to a "fast tool-call mode" that bypasses extended reasoning entirely.
Impact
This creates a fundamental conflict for agentic deployments:
- With tools enabled: The model can call functions but cannot reason deeply about complex questions
- Without tools enabled: The model reasons brilliantly but cannot call any functions
There is no middle ground. Users must choose between intelligence and capability.
Expected behavior
The model should be able to reason thoroughly even when tools are available, especially when:
- The question does not require any tool calls
- The question requires deep logical reasoning before deciding whether to use a tool
enable_thinking: true is explicitly set
Suggested improvement
- Allow configurable reasoning depth independently of tool presence
- Or: Only suppress extended reasoning after the model has decided to make a tool call, not preemptively when tools are merely available in the schema
- Or: Respect
thinking_budget in chat_template_kwargs even when tools are present
Comparison data
| Setup |
Reasoning duration |
Reasoning tokens |
Correct answer |
| No tools, with system prompt |
47-66s |
~3000-5000 |
✅ Yes |
| With tools (even a single dummy tool) |
2-4s |
300-500 |
⚠️ Correct but shallow |
Reproduction
Test prompt
Die Autowaschanlage ist 2 Minuten weit weg. Soll ich zu Fuß oder mit dem Auto hin?
(Translation: "The car wash is 2 minutes away. Should I walk or drive?")
This is a logic puzzle — the correct answer requires reasoning that the car must physically be at the car wash to be washed, therefore you must drive.
Without tools (full reasoning works)
curl -s http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer test" \
-H "Content-Type: application/json" \
-d '{
"model": "Kbenkhaled/Qwen3.5-27B-NVFP4",
"messages": [{"role": "user", "content": "Die Autowaschanlage ist 2 Minuten weit weg. Soll ich zu Fuß oder mit dem Auto hin?"}],
"stream": false,
"max_tokens": 8192
}'
Result: reasoning_content contains ~2000-4000 tokens of deep multi-step reasoning (47-66 seconds). The model correctly identifies the constraint that the car must be at the wash.
With tools (reasoning suppressed)
curl -s http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer Bearer test" \
-H "Content-Type: application/json" \
-d '{
"model": "Kbenkhaled/Qwen3.5-27B-NVFP4",
"messages": [{"role":"user","content":"Die Autowaschanlage ist 2 Minuten weit weg. Soll ich zu Fuß oder mit dem Auto hin?"}],
"stream": false,
"max_tokens": 8192,
"tools": [{"type":"function","function":{"name":"web_search","description":"Search the web","parameters":{"type":"object","properties":{"query":{"type":"string"}}}}}]
}' | python3 -c "
import sys,json;d=json.load(sys.stdin);m=d['choices'][0]['message']
# Check BOTH fields
rc = m.get('reasoning_content','') or ''
r = m.get('reasoning','') or ''
c = m.get('content','') or ''
print(f'reasoning_content: {len(rc)} chars')
print(f'reasoning: {len(r)} chars')
print(f'content: {len(c)} chars')
if r:
print(f'reasoning[:500]: {r[:500]}')
if rc:
print(f'reasoning_content[:500]: {rc[:500]}')
print(f'Usage: {d.get(\"usage\",{})}')
"
=== REASONING FIELD CHECK ===
reasoning_content: 0 chars
reasoning: 884 chars
content: 914 chars
reasoning[:500]: The user is asking whether they should walk or drive to a car wash that is 2 minutes away. This is a decision that depends on various factors like convenience, time, and practical considerations.
Let me think about this logically:
1. Distance: 2 minutes away - this is very close
2. If it's 2 minutes walking distance, that's roughly 150-200 meters
3. If it's 2 minutes driving distance, that's also very close
For such a short distance, walking is usually the better choice because:
- No need to
Usage: {'prompt_tokens': 291, 'total_tokens': 754, 'completion_tokens': 463, 'prompt_tokens_details': None}
Result: reasoning_content is a lot smaller. The model jumps directly to generating content with minimal reasoning. Total output ~800 tokens, duration ~3 seconds.
Logs
Environment Information
Environment
- Model:
Kbenkhaled/Qwen3.5-27B-NVFP4 (also reproducible with other Qwen3.5 variants)
- Inference: vLLM v0.17.1 (
vllm/vllm-openai:v0.17.1-cu130)
- GPU: NVIDIA RTX PRO 6000 Blackwell (96GB)
- vLLM flags:
--reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder
Known Issue
Description
Description
When tools are included in the API request payload, Qwen3.5-27B reduces tihinking tokens --- the
<think>block only contains a few sentences. The same prompt without tools produces 45-65 seconds of deep, multi-step reasoning. This is not a gradual degradation — it is a binary on/off behavior that makes it impossible to use tool-calling and reasoning together.This severely limits agentic use cases where the model needs to think deeply about a problem before deciding whether to call a tool or answer directly.
Analysis
The chat template's tool instruction block contains:
We patched these instructions to encourage reasoning ("Think carefully and reason thoroughly..."), but the behavior did not change. This confirms the suppression is not driven by the chat template instructions — it is a trained behavior embedded in the model weights. When the model sees
<tools>in its tokenized input, it switches to a "fast tool-call mode" that bypasses extended reasoning entirely.Impact
This creates a fundamental conflict for agentic deployments:
There is no middle ground. Users must choose between intelligence and capability.
Expected behavior
The model should be able to reason thoroughly even when tools are available, especially when:
enable_thinking: trueis explicitly setSuggested improvement
thinking_budgetinchat_template_kwargseven when tools are presentComparison data
Reproduction
Test prompt
(Translation: "The car wash is 2 minutes away. Should I walk or drive?")
This is a logic puzzle — the correct answer requires reasoning that the car must physically be at the car wash to be washed, therefore you must drive.
Without tools (full reasoning works)
Result:
reasoning_contentcontains ~2000-4000 tokens of deep multi-step reasoning (47-66 seconds). The model correctly identifies the constraint that the car must be at the wash.With tools (reasoning suppressed)
Result:
reasoning_contentis a lot smaller. The model jumps directly to generating content with minimal reasoning. Total output ~800 tokens, duration ~3 seconds.Logs
Environment Information
Environment
Kbenkhaled/Qwen3.5-27B-NVFP4(also reproducible with other Qwen3.5 variants)vllm/vllm-openai:v0.17.1-cu130)--reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coderKnown Issue