Skip to content

# Qwen3.5-27B: Reasoning (<think>) reduced greatly when tools are present in the request #89

@Tyrannius

Description

@Tyrannius

Description

Description

When tools are included in the API request payload, Qwen3.5-27B reduces tihinking tokens --- the<think> block only contains a few sentences. The same prompt without tools produces 45-65 seconds of deep, multi-step reasoning. This is not a gradual degradation — it is a binary on/off behavior that makes it impossible to use tool-calling and reasoning together.

This severely limits agentic use cases where the model needs to think deeply about a problem before deciding whether to call a tool or answer directly.

Analysis

The chat template's tool instruction block contains:

"If you choose to call a function ONLY reply in the following format with NO suffix"
"You may provide optional reasoning for your function call"

We patched these instructions to encourage reasoning ("Think carefully and reason thoroughly..."), but the behavior did not change. This confirms the suppression is not driven by the chat template instructions — it is a trained behavior embedded in the model weights. When the model sees <tools> in its tokenized input, it switches to a "fast tool-call mode" that bypasses extended reasoning entirely.

Impact

This creates a fundamental conflict for agentic deployments:

  • With tools enabled: The model can call functions but cannot reason deeply about complex questions
  • Without tools enabled: The model reasons brilliantly but cannot call any functions

There is no middle ground. Users must choose between intelligence and capability.

Expected behavior

The model should be able to reason thoroughly even when tools are available, especially when:

  1. The question does not require any tool calls
  2. The question requires deep logical reasoning before deciding whether to use a tool
  3. enable_thinking: true is explicitly set

Suggested improvement

  • Allow configurable reasoning depth independently of tool presence
  • Or: Only suppress extended reasoning after the model has decided to make a tool call, not preemptively when tools are merely available in the schema
  • Or: Respect thinking_budget in chat_template_kwargs even when tools are present

Comparison data

Setup Reasoning duration Reasoning tokens Correct answer
No tools, with system prompt 47-66s ~3000-5000 ✅ Yes
With tools (even a single dummy tool) 2-4s 300-500 ⚠️ Correct but shallow

Reproduction

Test prompt

Die Autowaschanlage ist 2 Minuten weit weg. Soll ich zu Fuß oder mit dem Auto hin?

(Translation: "The car wash is 2 minutes away. Should I walk or drive?")

This is a logic puzzle — the correct answer requires reasoning that the car must physically be at the car wash to be washed, therefore you must drive.

Without tools (full reasoning works)

curl -s http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer test" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Kbenkhaled/Qwen3.5-27B-NVFP4",
    "messages": [{"role": "user", "content": "Die Autowaschanlage ist 2 Minuten weit weg. Soll ich zu Fuß oder mit dem Auto hin?"}],
    "stream": false,
    "max_tokens": 8192
  }'

Result: reasoning_content contains ~2000-4000 tokens of deep multi-step reasoning (47-66 seconds). The model correctly identifies the constraint that the car must be at the wash.

With tools (reasoning suppressed)

curl -s http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer Bearer test" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Kbenkhaled/Qwen3.5-27B-NVFP4",
    "messages": [{"role":"user","content":"Die Autowaschanlage ist 2 Minuten weit weg. Soll ich zu Fuß oder mit dem Auto hin?"}],
    "stream": false,
    "max_tokens": 8192,
    "tools": [{"type":"function","function":{"name":"web_search","description":"Search the web","parameters":{"type":"object","properties":{"query":{"type":"string"}}}}}]
  }' | python3 -c "
import sys,json;d=json.load(sys.stdin);m=d['choices'][0]['message']
# Check BOTH fields
rc = m.get('reasoning_content','') or ''
r = m.get('reasoning','') or ''
c = m.get('content','') or ''
print(f'reasoning_content: {len(rc)} chars')
print(f'reasoning: {len(r)} chars')
print(f'content: {len(c)} chars')
if r:
    print(f'reasoning[:500]: {r[:500]}')
if rc:
    print(f'reasoning_content[:500]: {rc[:500]}')
print(f'Usage: {d.get(\"usage\",{})}')
"
=== REASONING FIELD CHECK ===
reasoning_content: 0 chars
reasoning: 884 chars
content: 914 chars
reasoning[:500]: The user is asking whether they should walk or drive to a car wash that is 2 minutes away. This is a decision that depends on various factors like convenience, time, and practical considerations.
Let me think about this logically:
1. Distance: 2 minutes away - this is very close
2. If it's 2 minutes walking distance, that's roughly 150-200 meters
3. If it's 2 minutes driving distance, that's also very close
For such a short distance, walking is usually the better choice because:
- No need to
Usage: {'prompt_tokens': 291, 'total_tokens': 754, 'completion_tokens': 463, 'prompt_tokens_details': None}

Result: reasoning_content is a lot smaller. The model jumps directly to generating content with minimal reasoning. Total output ~800 tokens, duration ~3 seconds.

Logs

Environment Information

Environment

  • Model: Kbenkhaled/Qwen3.5-27B-NVFP4 (also reproducible with other Qwen3.5 variants)
  • Inference: vLLM v0.17.1 (vllm/vllm-openai:v0.17.1-cu130)
  • GPU: NVIDIA RTX PRO 6000 Blackwell (96GB)
  • vLLM flags: --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder

Known Issue

  • The issue hasn't been already addressed in Documentation, Issues, and Discussions.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions