Jeopardy Language Model Benchmarking System.
๐ฎ Benchmark LLMs with Jeopardy! questions. Tournament-style testing for large language models. What is... your model's true performance?
A comprehensive benchmarking application that evaluates language models using Jeopardy questions from Kaggle, providing statistically significant and repeatable performance analysis through OpenRouter's API.
This system is designed to:
- โ Test multiple language models simultaneously using authentic Jeopardy questions
- โ Provide statistically significant benchmarking with proper sampling methodologies
- โ Measure key performance metrics: accuracy, response speed, cost efficiency, and consistency
- โ Generate comprehensive reports with category and difficulty-level analysis
- โ Support both CLI interface and future web interface expansion
- Statistical Sampling: Scientifically valid question selection ensuring 95% confidence level
- Fuzzy Answer Matching: Intelligent answer evaluation handling variations and formats
- Multi-Model Support: Concurrent testing of 5-10 language models via OpenRouter API
- Comprehensive Metrics: Accuracy, latency, tokens/second, cost analysis, and consistency tracking
- Category Analysis: Performance breakdown by Jeopardy categories and difficulty levels
- Reproducible Results: Deterministic benchmarking with configurable parameters
- Response accuracy (correct/incorrect with confidence scoring)
- Response speed (latency and tokens per second)
- Cost per query and cost-effectiveness ratios
- Model consistency across similar question types
- Category-specific performance analysis
- Difficulty-level performance based on Jeopardy dollar values
graph TB
A[Data Ingestion Layer] --> B[Question Selection Engine]
B --> C[Model Testing Engine]
C --> D[Answer Evaluation Engine]
D --> E[Metrics Calculation Engine]
E --> F[Results Storage Layer]
F --> G[Reporting & Analytics]
H[OpenRouter API] --> C
I[Kaggle Dataset] --> A
J[SQLite Database] --> F
- Backend: Python 3.8+ with async/await support
- Database: SQLite with SQLAlchemy ORM
- API Integration: OpenRouter via aiohttp
- Data Processing: Pandas, NumPy for statistical analysis
- Text Matching: FuzzyWuzzy with Levenshtein distance
- CLI Interface: Click with Rich for enhanced output
- Testing: Pytest with async support
alex-trebench/
โโโ config/ # Configuration files (YAML)
โ โโโ default.yaml # Main configuration
โ โโโ models/ # Model-specific settings
โโโ src/
โ โโโ main.py # CLI entry point (alex command)
โ โโโ core/ # Foundation components
โ โโโ data/ # Data ingestion and preprocessing
โ โโโ models/ # LLM API clients and adapters
โ โโโ evaluation/ # Answer matching and grading
โ โโโ benchmark/ # Execution engine and reporting
โ โโโ storage/ # Database models and repositories
โ โโโ cli/ # Command-line interface
โ โโโ commands/ # Command implementations
โ โโโ utils/ # Shared utilities
โโโ tests/ # Comprehensive test suite
โ โโโ unit/ # Unit tests
โ โโโ integration/ # Integration tests
โ โโโ e2e/ # End-to-end tests
โโโ docs/ # Documentation
โ โโโ USER_GUIDE.md # Complete user guide
โ โโโ API_REFERENCE.md # API documentation
โโโ scripts/ # Utility scripts
โโโ examples/ # Usage examples
โโโ data/ # Local data storage and cache
- User Guide: Complete user guide with installation, configuration, and usage examples
- API Reference: Comprehensive API documentation with code examples
- Technical Specification: Complete system architecture, database schema, algorithms, and API integration patterns
- Project Structure: Detailed directory organization, module responsibilities, and technology stack
- Implementation Roadmap: Development phases, priorities, and delivery timeline
- Sample Size: 1000 questions for statistical significance (95% confidence, 5% margin of error)
- Stratified Sampling: Proportional representation across categories and difficulty levels
- Reproducibility: Configurable random seed for consistent benchmark runs
- Multi-level Matching: Exact match, normalized comparison, and semantic similarity
- Fuzzy Scoring: Weighted combination of similarity metrics with confidence thresholds
- Format Flexibility: Handles Jeopardy answer format variations and common response patterns
# Core metrics calculated
accuracy_rate = correct_answers / total_questions
avg_response_time = mean(response_times_ms)
tokens_per_second = mean(tokens_generated / response_time_seconds)
cost_per_correct = total_cost / correct_answers
consistency_score = 1 - std_deviation(response_times) / mean(response_times)
- Python 3.8 or higher
- uv (recommended) or pip for package management
- OpenRouter API key (get one at openrouter.ai)
- Internet connection for API access
# Clone the repository
git clone <repository-url>
cd alex-trebench
# Install using uv (recommended)
uv pip install -e .
# Or using pip
pip install -e .
# Set up environment variables
export OPENROUTER_API_KEY="your_api_key_here"
# Or create .env file
echo "OPENROUTER_API_KEY=your_api_key_here" > .env
# Initialize the database
alex init
# Run a quick benchmark (50 questions)
alex benchmark run --model openai/gpt-3.5-turbo --size quick
# Run a standard benchmark (200 questions)
alex benchmark run --model openai/gpt-4 --size standard
# Compare multiple models
alex benchmark compare --models "openai/gpt-3.5-turbo,openai/gpt-4" --size quick
# View benchmark history
alex benchmark history --model openai/gpt-4
# Generate a report
alex benchmark report --run-id 1 --format markdown
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโ
โ Model โ Accuracy โ Avg Time โ Cost/Query โ Consistency โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโค
โ gpt-4-turbo โ 73.2% โ 1,240ms โ $0.003 โ 0.89 โ
โ claude-3-sonnet โ 71.8% โ 980ms โ $0.002 โ 0.92 โ
โ gpt-3.5-turbo โ 64.5% โ 650ms โ $0.001 โ 0.85 โ
โโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโ
Category Performance:
โข Science & Technology: GPT-4 (78%) > Claude-3 (75%) > GPT-3.5 (68%)
โข History: Claude-3 (74%) > GPT-4 (72%) > GPT-3.5 (63%)
โข Literature: GPT-4 (69%) > Claude-3 (67%) > GPT-3.5 (59%)
- Core Infrastructure: Complete project setup with modular architecture
- Data Pipeline: Kaggle integration, preprocessing, and statistical sampling
- Model Integration: OpenRouter API client with support for 20+ models
- Benchmark Engine: Complete benchmarking workflow with async processing
- Evaluation System: Fuzzy answer matching, grading, and metrics calculation
- Database Layer: SQLite with SQLAlchemy ORM and migration support
- CLI Interface: Comprehensive command-line interface with Rich formatting
- Reporting System: Multiple output formats (terminal, markdown, JSON)
- Testing Suite: Unit, integration, and end-to-end tests with 80%+ coverage
- Documentation: Complete user guide and API reference
- Performance Optimization: Memory usage and concurrent processing improvements
- Web Interface: Optional FastAPI-based REST API (future enhancement)
- Advanced Analytics: Trend analysis and model comparison tools
# Quick test with GPT-3.5-turbo
alex benchmark run --model openai/gpt-3.5-turbo --size quick
# Comprehensive evaluation with GPT-4
alex benchmark run \
--model openai/gpt-4 \
--size comprehensive \
--name "GPT-4 Comprehensive Test" \
--report-format markdown \
--output gpt4_report.md
# Compare popular models
alex benchmark compare \
--models "openai/gpt-3.5-turbo,openai/gpt-4,anthropic/claude-3-haiku" \
--size standard \
--concurrent-limit 3
# Generate comparison report
alex benchmark compare \
--models "openai/gpt-4,anthropic/claude-3-sonnet" \
--size quick \
--report-format json \
--output model_comparison.json
# Custom benchmark with specific settings
alex benchmark run \
--model openai/gpt-4 \
--size custom \
--sample-size 500 \
--timeout 120 \
--grading-mode lenient \
--name "Custom Benchmark" \
--description "Testing with custom parameters"
# Initialize dataset
alex data init
# Sample questions by category
alex data sample \
--category "SCIENCE" \
--size 100 \
--output science_questions.json
# View dataset statistics
alex data stats
# List all available models
alex models list
# Test model connectivity
alex models test --model openai/gpt-3.5-turbo
# Estimate costs
alex models costs --model openai/gpt-4 --questions 1000
Verify your alex-treBENCH installation is working correctly:
# Quick verification script
./scripts/quick_test.sh
# Or run the smoke test directly
python scripts/smoke_test.py
# Using Make
make smoke-test
The smoke test provides complete end-to-end verification of the alex-treBENCH system:
- โ Database initialization - Creates and verifies database schema
- โ Sample data loading - Loads test questions into database
- โ API connectivity - Tests OpenRouter integration (real or simulated)
- โ Benchmark execution - Runs minimal benchmark with 3 questions
- โ Report generation - Creates and validates performance reports
- โ System health - Verifies all critical components
Cost: ~$0.001-0.005 per run with API key, $0.00 in simulation mode
# Comprehensive test suite
make test # All tests
make test-coverage # With coverage report
make test-unit # Unit tests only
make test-integration # Integration tests
make test-e2e # End-to-end tests
# Component-specific testing
make test-agents # Individual component tests
python scripts/test_agents.py
๐ฅ alex-treBENCH Smoke Test
Running complete end-to-end system verification
โ
Setting up test environment...
โ
Initializing database...
โ
Loading sample data...
โ
Running minimal benchmark...
โ
Generating report...
โ
Verifying system health...
๐ Smoke Test PASSED
alex-treBENCH system is working correctly!
Tests automatically run on:
- Pull requests to main/develop branches
- Pushes to main/develop branches
- Manual workflow triggers with optional real API testing
See .github/workflows/smoke-test.yml
for CI configuration.
For comprehensive testing information, troubleshooting, and advanced test scenarios:
Covers:
- Detailed test agent documentation
- Troubleshooting common issues
- Cost management strategies
- Performance testing
- Writing new tests
- CI/CD integration
- Kaggle: For providing the Jeopardy dataset (aravindram11/jeopardy-dataset-updated)
- OpenRouter: For unified language model API access
- Jeopardy!: For creating the foundational question format that makes this benchmarking meaningful
For questions, issues, or contributions:
- ๐ Read the User Guide for detailed usage instructions
- ๐ง Check the API Reference for technical details
- ๐ Create an issue in the GitHub repository
- ๐ฌ Review the technical documentation in
TECHNICAL_SPEC.md
๐ Implementation Complete: This system is now fully implemented and production-ready. All core features are functional with comprehensive testing and documentation.