Skip to content

๐ŸŽฎ Benchmark LLMs with Jeopardy! questions. Tournament-style testing for large language models. What is... your model's true performance?

Notifications You must be signed in to change notification settings

Kilo-Org/alex-treBENCH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

70 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

alex-treBENCH!

Jeopardy Language Model Benchmarking System.

๐ŸŽฎ Benchmark LLMs with Jeopardy! questions. Tournament-style testing for large language models. What is... your model's true performance?

Python 3.8+ License: MIT Tests

A comprehensive benchmarking application that evaluates language models using Jeopardy questions from Kaggle, providing statistically significant and repeatable performance analysis through OpenRouter's API.

๐ŸŽฏ Project Overview

This system is designed to:

  • โœ… Test multiple language models simultaneously using authentic Jeopardy questions
  • โœ… Provide statistically significant benchmarking with proper sampling methodologies
  • โœ… Measure key performance metrics: accuracy, response speed, cost efficiency, and consistency
  • โœ… Generate comprehensive reports with category and difficulty-level analysis
  • โœ… Support both CLI interface and future web interface expansion

๐Ÿ“‹ Key Features

Core Capabilities

  • Statistical Sampling: Scientifically valid question selection ensuring 95% confidence level
  • Fuzzy Answer Matching: Intelligent answer evaluation handling variations and formats
  • Multi-Model Support: Concurrent testing of 5-10 language models via OpenRouter API
  • Comprehensive Metrics: Accuracy, latency, tokens/second, cost analysis, and consistency tracking
  • Category Analysis: Performance breakdown by Jeopardy categories and difficulty levels
  • Reproducible Results: Deterministic benchmarking with configurable parameters

Performance Metrics Tracked

  • Response accuracy (correct/incorrect with confidence scoring)
  • Response speed (latency and tokens per second)
  • Cost per query and cost-effectiveness ratios
  • Model consistency across similar question types
  • Category-specific performance analysis
  • Difficulty-level performance based on Jeopardy dollar values

๐Ÿ—๏ธ Architecture Overview

System Components

graph TB
    A[Data Ingestion Layer] --> B[Question Selection Engine]
    B --> C[Model Testing Engine]
    C --> D[Answer Evaluation Engine]
    D --> E[Metrics Calculation Engine]
    E --> F[Results Storage Layer]
    F --> G[Reporting & Analytics]

    H[OpenRouter API] --> C
    I[Kaggle Dataset] --> A
    J[SQLite Database] --> F
Loading

Technology Stack

  • Backend: Python 3.8+ with async/await support
  • Database: SQLite with SQLAlchemy ORM
  • API Integration: OpenRouter via aiohttp
  • Data Processing: Pandas, NumPy for statistical analysis
  • Text Matching: FuzzyWuzzy with Levenshtein distance
  • CLI Interface: Click with Rich for enhanced output
  • Testing: Pytest with async support

๐Ÿ“ Project Structure

alex-trebench/
โ”œโ”€โ”€ config/                    # Configuration files (YAML)
โ”‚   โ”œโ”€โ”€ default.yaml           # Main configuration
โ”‚   โ””โ”€โ”€ models/                # Model-specific settings
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ main.py                # CLI entry point (alex command)
โ”‚   โ”œโ”€โ”€ core/                  # Foundation components
โ”‚   โ”œโ”€โ”€ data/                  # Data ingestion and preprocessing
โ”‚   โ”œโ”€โ”€ models/                # LLM API clients and adapters
โ”‚   โ”œโ”€โ”€ evaluation/            # Answer matching and grading
โ”‚   โ”œโ”€โ”€ benchmark/             # Execution engine and reporting
โ”‚   โ”œโ”€โ”€ storage/               # Database models and repositories
โ”‚   โ”œโ”€โ”€ cli/                   # Command-line interface
โ”‚   โ”œโ”€โ”€ commands/              # Command implementations
โ”‚   โ””โ”€โ”€ utils/                 # Shared utilities
โ”œโ”€โ”€ tests/                     # Comprehensive test suite
โ”‚   โ”œโ”€โ”€ unit/                  # Unit tests
โ”‚   โ”œโ”€โ”€ integration/           # Integration tests
โ”‚   โ””โ”€โ”€ e2e/                   # End-to-end tests
โ”œโ”€โ”€ docs/                      # Documentation
โ”‚   โ”œโ”€โ”€ USER_GUIDE.md          # Complete user guide
โ”‚   โ””โ”€โ”€ API_REFERENCE.md       # API documentation
โ”œโ”€โ”€ scripts/                   # Utility scripts
โ”œโ”€โ”€ examples/                  # Usage examples
โ””โ”€โ”€ data/                      # Local data storage and cache

๐Ÿ“– Documentation

User Documentation

  • User Guide: Complete user guide with installation, configuration, and usage examples
  • API Reference: Comprehensive API documentation with code examples

Technical Documentation

Key Specifications

Statistical Sampling

  • Sample Size: 1000 questions for statistical significance (95% confidence, 5% margin of error)
  • Stratified Sampling: Proportional representation across categories and difficulty levels
  • Reproducibility: Configurable random seed for consistent benchmark runs

Answer Evaluation Methodology

  • Multi-level Matching: Exact match, normalized comparison, and semantic similarity
  • Fuzzy Scoring: Weighted combination of similarity metrics with confidence thresholds
  • Format Flexibility: Handles Jeopardy answer format variations and common response patterns

Performance Metrics

# Core metrics calculated
accuracy_rate = correct_answers / total_questions
avg_response_time = mean(response_times_ms)
tokens_per_second = mean(tokens_generated / response_time_seconds)
cost_per_correct = total_cost / correct_answers
consistency_score = 1 - std_deviation(response_times) / mean(response_times)

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.8 or higher
  • uv (recommended) or pip for package management
  • OpenRouter API key (get one at openrouter.ai)
  • Internet connection for API access

Installation

# Clone the repository
git clone <repository-url>
cd alex-trebench

# Install using uv (recommended)
uv pip install -e .

# Or using pip
pip install -e .

Configuration

# Set up environment variables
export OPENROUTER_API_KEY="your_api_key_here"

# Or create .env file
echo "OPENROUTER_API_KEY=your_api_key_here" > .env

# Initialize the database
alex init

Basic Usage

# Run a quick benchmark (50 questions)
alex benchmark run --model openai/gpt-3.5-turbo --size quick

# Run a standard benchmark (200 questions)
alex benchmark run --model openai/gpt-4 --size standard

# Compare multiple models
alex benchmark compare --models "openai/gpt-3.5-turbo,openai/gpt-4" --size quick

# View benchmark history
alex benchmark history --model openai/gpt-4

# Generate a report
alex benchmark report --run-id 1 --format markdown

๐Ÿ“Š Sample Output

Benchmark Results Summary

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Model           โ”‚ Accuracy โ”‚ Avg Time    โ”‚ Cost/Query   โ”‚ Consistency โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ gpt-4-turbo     โ”‚ 73.2%    โ”‚ 1,240ms     โ”‚ $0.003       โ”‚ 0.89        โ”‚
โ”‚ claude-3-sonnet โ”‚ 71.8%    โ”‚ 980ms       โ”‚ $0.002       โ”‚ 0.92        โ”‚
โ”‚ gpt-3.5-turbo   โ”‚ 64.5%    โ”‚ 650ms       โ”‚ $0.001       โ”‚ 0.85        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Category Performance:
โ€ข Science & Technology: GPT-4 (78%) > Claude-3 (75%) > GPT-3.5 (68%)
โ€ข History: Claude-3 (74%) > GPT-4 (72%) > GPT-3.5 (63%)
โ€ข Literature: GPT-4 (69%) > Claude-3 (67%) > GPT-3.5 (59%)

๐Ÿ“ˆ Implementation Status

โœ… Completed Features

  • Core Infrastructure: Complete project setup with modular architecture
  • Data Pipeline: Kaggle integration, preprocessing, and statistical sampling
  • Model Integration: OpenRouter API client with support for 20+ models
  • Benchmark Engine: Complete benchmarking workflow with async processing
  • Evaluation System: Fuzzy answer matching, grading, and metrics calculation
  • Database Layer: SQLite with SQLAlchemy ORM and migration support
  • CLI Interface: Comprehensive command-line interface with Rich formatting
  • Reporting System: Multiple output formats (terminal, markdown, JSON)
  • Testing Suite: Unit, integration, and end-to-end tests with 80%+ coverage
  • Documentation: Complete user guide and API reference

๐Ÿšง Current Development

  • Performance Optimization: Memory usage and concurrent processing improvements
  • Web Interface: Optional FastAPI-based REST API (future enhancement)
  • Advanced Analytics: Trend analysis and model comparison tools

๐Ÿ“‹ Usage Examples

Single Model Benchmark

# Quick test with GPT-3.5-turbo
alex benchmark run --model openai/gpt-3.5-turbo --size quick

# Comprehensive evaluation with GPT-4
alex benchmark run \
  --model openai/gpt-4 \
  --size comprehensive \
  --name "GPT-4 Comprehensive Test" \
  --report-format markdown \
  --output gpt4_report.md

Model Comparison

# Compare popular models
alex benchmark compare \
  --models "openai/gpt-3.5-turbo,openai/gpt-4,anthropic/claude-3-haiku" \
  --size standard \
  --concurrent-limit 3

# Generate comparison report
alex benchmark compare \
  --models "openai/gpt-4,anthropic/claude-3-sonnet" \
  --size quick \
  --report-format json \
  --output model_comparison.json

Advanced Configuration

# Custom benchmark with specific settings
alex benchmark run \
  --model openai/gpt-4 \
  --size custom \
  --sample-size 500 \
  --timeout 120 \
  --grading-mode lenient \
  --name "Custom Benchmark" \
  --description "Testing with custom parameters"

Data Management

# Initialize dataset
alex data init

# Sample questions by category
alex data sample \
  --category "SCIENCE" \
  --size 100 \
  --output science_questions.json

# View dataset statistics
alex data stats

Model Management

# List all available models
alex models list

# Test model connectivity
alex models test --model openai/gpt-3.5-turbo

# Estimate costs
alex models costs --model openai/gpt-4 --questions 1000

๐Ÿงช Testing & Verification

Quick System Verification

Verify your alex-treBENCH installation is working correctly:

# Quick verification script
./scripts/quick_test.sh

# Or run the smoke test directly
python scripts/smoke_test.py

# Using Make
make smoke-test

Smoke Test

The smoke test provides complete end-to-end verification of the alex-treBENCH system:

  • โœ… Database initialization - Creates and verifies database schema
  • โœ… Sample data loading - Loads test questions into database
  • โœ… API connectivity - Tests OpenRouter integration (real or simulated)
  • โœ… Benchmark execution - Runs minimal benchmark with 3 questions
  • โœ… Report generation - Creates and validates performance reports
  • โœ… System health - Verifies all critical components

Cost: ~$0.001-0.005 per run with API key, $0.00 in simulation mode

Test Categories

# Comprehensive test suite
make test              # All tests
make test-coverage     # With coverage report
make test-unit         # Unit tests only
make test-integration  # Integration tests
make test-e2e          # End-to-end tests

# Component-specific testing
make test-agents       # Individual component tests
python scripts/test_agents.py

Expected Output (Smoke Test Success)

๐Ÿ”ฅ alex-treBENCH Smoke Test
Running complete end-to-end system verification

โœ… Setting up test environment...
โœ… Initializing database...
โœ… Loading sample data...
โœ… Running minimal benchmark...
โœ… Generating report...
โœ… Verifying system health...

๐ŸŽ‰ Smoke Test PASSED
alex-treBENCH system is working correctly!

Continuous Integration

Tests automatically run on:

  • Pull requests to main/develop branches
  • Pushes to main/develop branches
  • Manual workflow triggers with optional real API testing

See .github/workflows/smoke-test.yml for CI configuration.

Full Testing Documentation

For comprehensive testing information, troubleshooting, and advanced test scenarios:

๐Ÿ“– Complete Testing Guide

Covers:

  • Detailed test agent documentation
  • Troubleshooting common issues
  • Cost management strategies
  • Performance testing
  • Writing new tests
  • CI/CD integration

๐Ÿ™ Acknowledgments

  • Kaggle: For providing the Jeopardy dataset (aravindram11/jeopardy-dataset-updated)
  • OpenRouter: For unified language model API access
  • Jeopardy!: For creating the foundational question format that makes this benchmarking meaningful

๐Ÿ“ž Support

For questions, issues, or contributions:

  • ๐Ÿ“– Read the User Guide for detailed usage instructions
  • ๐Ÿ”ง Check the API Reference for technical details
  • ๐Ÿ› Create an issue in the GitHub repository
  • ๐Ÿ’ฌ Review the technical documentation in TECHNICAL_SPEC.md

๐ŸŽ‰ Implementation Complete: This system is now fully implemented and production-ready. All core features are functional with comprehensive testing and documentation.

About

๐ŸŽฎ Benchmark LLMs with Jeopardy! questions. Tournament-style testing for large language models. What is... your model's true performance?

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published