🥰 Feature Description
This request builds upon the potential bug regarding model aliases (Issue #156). While the powerful 'thinking' or 'reasoning' features (e.g. from OpenAI and Anthropic) provide significant performance boosts, they make direct comparisons difficult.
🧐 Proposed Solution
Therefore, I propose creating two distinct tracks on the leaderboard.
-
Base Capability Track (Non-Thinking):
- All models run in their standard API mode without any reasoning enhancements.
- Focuses on fundamental performance, measuring critical metrics like
TTFT (Time to First Token) and TPS (Tokens per Second).
-
Peak Performance Track (Thinking-Enabled):
- Explicitly enables reasoning features for all supported models (e.g.,
gpt-5-thinking, Claude's extended reasoning).
- Showcases the upper limit of each model's capability in solving complex, multi-step problems.
- Could report metrics like average reasoning tokens used.