[Feature Request] Add separate leaderboard tracks for "Thinking" and "Non-Thinking" evaluations

### 🥰 Feature Description

This request builds upon the potential bug regarding model aliases (Issue #156). While the powerful 'thinking' or 'reasoning' features (e.g. from OpenAI and Anthropic) provide significant performance boosts, they make direct comparisons difficult.

### 🧐 Proposed Solution

Therefore, I propose creating two distinct tracks on the leaderboard.

1.  **Base Capability Track (Non-Thinking):**
    *   All models run in their standard API mode without any reasoning enhancements.
    *   Focuses on fundamental performance, measuring critical metrics like `TTFT` (Time to First Token) and `TPS` (Tokens per Second).

2.  **Peak Performance Track (Thinking-Enabled):**
    *   Explicitly enables reasoning features for all supported models (e.g., `gpt-5-thinking`, Claude's [extended reasoning](https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking)).
    *   Showcases the upper limit of each model's capability in solving complex, multi-step problems.
    *   Could report metrics like average reasoning tokens used.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Add separate leaderboard tracks for "Thinking" and "Non-Thinking" evaluations #157

🥰 Feature Description

🧐 Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Add separate leaderboard tracks for "Thinking" and "Non-Thinking" evaluations #157

Description

🥰 Feature Description

🧐 Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions