alex-treBENCH!

Jeopardy Language Model Benchmarking System.

🎮 Benchmark LLMs with Jeopardy! questions. Tournament-style testing for large language models. What is... your model's true performance?

A comprehensive benchmarking application that evaluates language models using Jeopardy questions from Kaggle, providing statistically significant and repeatable performance analysis through OpenRouter's API.

🎯 Project Overview

This system is designed to:

✅ Test multiple language models simultaneously using authentic Jeopardy questions
✅ Provide statistically significant benchmarking with proper sampling methodologies
✅ Measure key performance metrics: accuracy, response speed, cost efficiency, and consistency
✅ Generate comprehensive reports with category and difficulty-level analysis
✅ Support both CLI interface and future web interface expansion

📋 Key Features

Core Capabilities

Statistical Sampling: Scientifically valid question selection ensuring 95% confidence level
Fuzzy Answer Matching: Intelligent answer evaluation handling variations and formats
Multi-Model Support: Concurrent testing of 5-10 language models via OpenRouter API
Comprehensive Metrics: Accuracy, latency, tokens/second, cost analysis, and consistency tracking
Category Analysis: Performance breakdown by Jeopardy categories and difficulty levels
Reproducible Results: Deterministic benchmarking with configurable parameters

Performance Metrics Tracked

Response accuracy (correct/incorrect with confidence scoring)
Response speed (latency and tokens per second)
Cost per query and cost-effectiveness ratios
Model consistency across similar question types
Category-specific performance analysis
Difficulty-level performance based on Jeopardy dollar values

🏗️ Architecture Overview

System Components

graph TB
    A[Data Ingestion Layer] --> B[Question Selection Engine]
    B --> C[Model Testing Engine]
    C --> D[Answer Evaluation Engine]
    D --> E[Metrics Calculation Engine]
    E --> F[Results Storage Layer]
    F --> G[Reporting & Analytics]

    H[OpenRouter API] --> C
    I[Kaggle Dataset] --> A
    J[SQLite Database] --> F

Technology Stack

Backend: Python 3.8+ with async/await support
Database: SQLite with SQLAlchemy ORM
API Integration: OpenRouter via aiohttp
Data Processing: Pandas, NumPy for statistical analysis
Text Matching: FuzzyWuzzy with Levenshtein distance
CLI Interface: Click with Rich for enhanced output
Testing: Pytest with async support

📁 Project Structure

alex-trebench/
├── config/                    # Configuration files (YAML)
│   ├── default.yaml           # Main configuration
│   └── models/                # Model-specific settings
├── src/
│   ├── main.py                # CLI entry point (alex command)
│   ├── core/                  # Foundation components
│   ├── data/                  # Data ingestion and preprocessing
│   ├── models/                # LLM API clients and adapters
│   ├── evaluation/            # Answer matching and grading
│   ├── benchmark/             # Execution engine and reporting
│   ├── storage/               # Database models and repositories
│   ├── cli/                   # Command-line interface
│   ├── commands/              # Command implementations
│   └── utils/                 # Shared utilities
├── tests/                     # Comprehensive test suite
│   ├── unit/                  # Unit tests
│   ├── integration/           # Integration tests
│   └── e2e/                   # End-to-end tests
├── docs/                      # Documentation
│   ├── USER_GUIDE.md          # Complete user guide
│   └── API_REFERENCE.md       # API documentation
├── scripts/                   # Utility scripts
├── examples/                  # Usage examples
└── data/                      # Local data storage and cache

📖 Documentation

User Documentation

User Guide: Complete user guide with installation, configuration, and usage examples
API Reference: Comprehensive API documentation with code examples

Technical Documentation

Technical Specification: Complete system architecture, database schema, algorithms, and API integration patterns
Project Structure: Detailed directory organization, module responsibilities, and technology stack
Implementation Roadmap: Development phases, priorities, and delivery timeline

Key Specifications

Statistical Sampling

Sample Size: 1000 questions for statistical significance (95% confidence, 5% margin of error)
Stratified Sampling: Proportional representation across categories and difficulty levels
Reproducibility: Configurable random seed for consistent benchmark runs

Answer Evaluation Methodology

Multi-level Matching: Exact match, normalized comparison, and semantic similarity
Fuzzy Scoring: Weighted combination of similarity metrics with confidence thresholds
Format Flexibility: Handles Jeopardy answer format variations and common response patterns

Performance Metrics

# Core metrics calculated
accuracy_rate = correct_answers / total_questions
avg_response_time = mean(response_times_ms)
tokens_per_second = mean(tokens_generated / response_time_seconds)
cost_per_correct = total_cost / correct_answers
consistency_score = 1 - std_deviation(response_times) / mean(response_times)

🚀 Quick Start

Prerequisites

Python 3.8 or higher
uv (recommended) or pip for package management
OpenRouter API key (get one at openrouter.ai)
Internet connection for API access

Installation

# Clone the repository
git clone <repository-url>
cd alex-trebench

# Install using uv (recommended)
uv pip install -e .

# Or using pip
pip install -e .

Configuration

# Set up environment variables
export OPENROUTER_API_KEY="your_api_key_here"

# Or create .env file
echo "OPENROUTER_API_KEY=your_api_key_here" > .env

# Initialize the database
alex init

Basic Usage

# Run a quick benchmark (50 questions)
alex benchmark run --model openai/gpt-3.5-turbo --size quick

# Run a standard benchmark (200 questions)
alex benchmark run --model openai/gpt-4 --size standard

# Compare multiple models
alex benchmark compare --models "openai/gpt-3.5-turbo,openai/gpt-4" --size quick

# View benchmark history
alex benchmark history --model openai/gpt-4

# Generate a report
alex benchmark report --run-id 1 --format markdown

📊 Sample Output

Benchmark Results Summary

┌─────────────────┬──────────┬─────────────┬──────────────┬─────────────┐
│ Model           │ Accuracy │ Avg Time    │ Cost/Query   │ Consistency │
├─────────────────┼──────────┼─────────────┼──────────────┼─────────────┤
│ gpt-4-turbo     │ 73.2%    │ 1,240ms     │ $0.003       │ 0.89        │
│ claude-3-sonnet │ 71.8%    │ 980ms       │ $0.002       │ 0.92        │
│ gpt-3.5-turbo   │ 64.5%    │ 650ms       │ $0.001       │ 0.85        │
└─────────────────┴──────────┴─────────────┴──────────────┴─────────────┘

Category Performance:
• Science & Technology: GPT-4 (78%) > Claude-3 (75%) > GPT-3.5 (68%)
• History: Claude-3 (74%) > GPT-4 (72%) > GPT-3.5 (63%)
• Literature: GPT-4 (69%) > Claude-3 (67%) > GPT-3.5 (59%)

📈 Implementation Status

✅ Completed Features

Core Infrastructure: Complete project setup with modular architecture
Data Pipeline: Kaggle integration, preprocessing, and statistical sampling
Model Integration: OpenRouter API client with support for 20+ models
Benchmark Engine: Complete benchmarking workflow with async processing
Evaluation System: Fuzzy answer matching, grading, and metrics calculation
Database Layer: SQLite with SQLAlchemy ORM and migration support
CLI Interface: Comprehensive command-line interface with Rich formatting
Reporting System: Multiple output formats (terminal, markdown, JSON)
Testing Suite: Unit, integration, and end-to-end tests with 80%+ coverage
Documentation: Complete user guide and API reference

🚧 Current Development

Performance Optimization: Memory usage and concurrent processing improvements
Web Interface: Optional FastAPI-based REST API (future enhancement)
Advanced Analytics: Trend analysis and model comparison tools

📋 Usage Examples

Single Model Benchmark

# Quick test with GPT-3.5-turbo
alex benchmark run --model openai/gpt-3.5-turbo --size quick

# Comprehensive evaluation with GPT-4
alex benchmark run \
  --model openai/gpt-4 \
  --size comprehensive \
  --name "GPT-4 Comprehensive Test" \
  --report-format markdown \
  --output gpt4_report.md

Model Comparison

# Compare popular models
alex benchmark compare \
  --models "openai/gpt-3.5-turbo,openai/gpt-4,anthropic/claude-3-haiku" \
  --size standard \
  --concurrent-limit 3

# Generate comparison report
alex benchmark compare \
  --models "openai/gpt-4,anthropic/claude-3-sonnet" \
  --size quick \
  --report-format json \
  --output model_comparison.json

Advanced Configuration

# Custom benchmark with specific settings
alex benchmark run \
  --model openai/gpt-4 \
  --size custom \
  --sample-size 500 \
  --timeout 120 \
  --grading-mode lenient \
  --name "Custom Benchmark" \
  --description "Testing with custom parameters"

Data Management

# Initialize dataset
alex data init

# Sample questions by category
alex data sample \
  --category "SCIENCE" \
  --size 100 \
  --output science_questions.json

# View dataset statistics
alex data stats

Model Management

# List all available models
alex models list

# Test model connectivity
alex models test --model openai/gpt-3.5-turbo

# Estimate costs
alex models costs --model openai/gpt-4 --questions 1000

🧪 Testing & Verification

Quick System Verification

Verify your alex-treBENCH installation is working correctly:

# Quick verification script
./scripts/quick_test.sh

# Or run the smoke test directly
python scripts/smoke_test.py

# Using Make
make smoke-test

Smoke Test

The smoke test provides complete end-to-end verification of the alex-treBENCH system:

✅ Database initialization - Creates and verifies database schema
✅ Sample data loading - Loads test questions into database
✅ API connectivity - Tests OpenRouter integration (real or simulated)
✅ Benchmark execution - Runs minimal benchmark with 3 questions
✅ Report generation - Creates and validates performance reports
✅ System health - Verifies all critical components

Cost: ~$0.001-0.005 per run with API key, $0.00 in simulation mode

Test Categories

# Comprehensive test suite
make test              # All tests
make test-coverage     # With coverage report
make test-unit         # Unit tests only
make test-integration  # Integration tests
make test-e2e          # End-to-end tests

# Component-specific testing
make test-agents       # Individual component tests
python scripts/test_agents.py

Expected Output (Smoke Test Success)

🔥 alex-treBENCH Smoke Test
Running complete end-to-end system verification

✅ Setting up test environment...
✅ Initializing database...
✅ Loading sample data...
✅ Running minimal benchmark...
✅ Generating report...
✅ Verifying system health...

🎉 Smoke Test PASSED
alex-treBENCH system is working correctly!

Continuous Integration

Tests automatically run on:

Pull requests to main/develop branches
Pushes to main/develop branches
Manual workflow triggers with optional real API testing

See .github/workflows/smoke-test.yml for CI configuration.

Full Testing Documentation

For comprehensive testing information, troubleshooting, and advanced test scenarios:

📖 Complete Testing Guide

Covers:

Detailed test agent documentation
Troubleshooting common issues
Cost management strategies
Performance testing
Writing new tests
CI/CD integration

🙏 Acknowledgments

Kaggle: For providing the Jeopardy dataset (aravindram11/jeopardy-dataset-updated)
OpenRouter: For unified language model API access
Jeopardy!: For creating the foundational question format that makes this benchmarking meaningful

📞 Support

For questions, issues, or contributions:

📖 Read the User Guide for detailed usage instructions
🔧 Check the API Reference for technical details
🐛 Create an issue in the GitHub repository
💬 Review the technical documentation in TECHNICAL_SPEC.md

🎉 Implementation Complete: This system is now fully implemented and production-ready. All core features are functional with comprehensive testing and documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.github/workflows		.github/workflows
.kilocode		.kilocode
config		config
docs		docs
examples		examples
frontend		frontend
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
Dockerfile		Dockerfile
INSTALLATION.md		INSTALLATION.md
Makefile		Makefile
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
ROADMAP.md		ROADMAP.md
TASK_OUTPUT.md		TASK_OUTPUT.md
TECHNICAL_SPEC.md		TECHNICAL_SPEC.md
alex_cli.py		alex_cli.py
build_binary.py		build_binary.py
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py
test_jeopardy_scores.py		test_jeopardy_scores.py

Kilo-Org/alex-treBENCH

Folders and files

Latest commit

History

Repository files navigation