LLM Game Theory Benchmark

Measure how well a raw LLM plays in simple strategic settings before any post-training.

Project Overview

This project implements a benchmark for evaluating LLM performance in zero-sum matrix games. The benchmark computes the "Nash gap" — the difference between what the LLM achieves and the best response payoff against a Nash equilibrium opponent.

Key Concepts

Matrix Game: A two-player zero-sum game with finite action spaces represented as a matrix
Nash Equilibrium: A pair of mixed strategies where neither player can improve by unilateral deviation
Nash Gap: The difference between best response value and LLM's value against Nash opponent
- Gap = BR_value - LLM_value ≥ 0
- Gap = 0 means LLM plays optimally
- Larger gaps indicate worse performance

Project Structure

.
├── src/
│   ├── __init__.py
│   ├── game_generator.py      # Generate random matrix games
│   ├── nash_solver.py         # Compute Nash equilibrium using LP
│   ├── llm_interface.py       # LLM query interface and response parsing
│   └── benchmark.py           # Benchmark runner and metrics
├── tests/                      # Unit tests
├── notebooks/                  # Jupyter notebooks for analysis
├── main.py                     # Main benchmark runner
├── requirements.txt            # Python dependencies
└── README.md                   # This file

Installation

Create a Python environment (Python 3.8+)
Install dependencies:

pip install -r requirements.txt

Optional analysis dependencies (for analyze_performance.py):

pip install pandas matplotlib

Quick Start

Using Together AI (Hosted Llama 3)

Set your API key and run a combined benchmark (pure + mixed):

export TOGETHER_API_KEY="your_key_here"
python main.py --num-games 10 --num-trials 10 --combined --seed 42

By default, this writes results to a timestamped folder:

results/pure_and_mixed_YYYYMMDD_HHMMSS/

To overwrite a fixed folder each run:

python main.py --num-games 10 --num-trials 10 --combined --overwrite --seed 42

Overwritten results go to:

results/pure_and_mixed_latest/

Using Dummy LLM (Random baseline)

For quick testing without API calls:

python main.py --num-games 10 --num-trials 10 --llm-type dummy --seed 42

Command Line Options

--num-games: Number of games to generate (default: 100)
--num-trials: Number of trials per game (default: 100)
--num-rows: Number of row player actions (default: 3)
--num-cols: Number of column player actions (default: 3)
--seed: Random seed for game generation
--llm-seed: Random seed for LLM (DummyLLM only)
--llm-type: LLM backend type: dummy, together, openai (default: together)
--llm-model: Specific model name (default: Together Llama 3.1 70B Turbo)
--output-dir: Directory to save results (default: "results")
--parallel: Enable parallel workers for faster execution (especially useful for network-based LLMs)
--num-workers: Number of parallel workers (default: 4, max recommended: 8-16)
--combined: Run both pure action and mixed strategy benchmarks on the same games (uses Together AI)
--overwrite: Overwrite combined results in results/pure_and_mixed_latest/ (default is a timestamped folder)

Examples

Run 100 games × 100 trials with Together AI:

export TOGETHER_API_KEY="your_key_here"
python main.py --num-games 100 --num-trials 100 --seed 42

Run smaller experiment with dummy LLM (for testing):

python main.py --num-games 10 --num-trials 10 --llm-type dummy --seed 42

Custom game size (2x2 games) with Together AI:

export TOGETHER_API_KEY="your_key_here"
python main.py --num-games 50 --num-trials 50 --num-rows 2 --num-cols 2 --seed 123

Parallel execution (4-10x faster for network-based LLMs):

# Uses 4 parallel workers (adjust with --num-workers)
python main.py --num-games 100 --num-trials 100 --llm-type together --parallel --num-workers 8 --seed 42

Faster execution tips:

Reduce trials: --num-trials 20 (5x faster)
Use parallelization: --parallel --num-workers 8 (best for remote LLMs)
Together AI: --llm-type together --parallel (good balance of speed & accuracy)

Benchmark Protocol

For each game G with payoff matrix U:

Generate Game: Create random payoff matrix
Query LLM: Prompt LLM with game and ask for action (pure or mixed)
Compute Nash: Calculate Nash equilibrium (π₁*, π₂*)
Fix Opponent: Opponent plays Nash strategy π₂*
Measure Gap: Compute gap(G) = V_BR(G) - V_LLM(G)

Metrics

For each game, the benchmark computes:

LLM Value: V_LLM(G) = E[a ~ π_LLM, b ~ π₂*][U(a,b)]
Best Response Value: V_BR(G) = max_a E[b ~ π₂*][U(a,b)]
Nash Gap: gap(G) = V_BR(G) - V_LLM(G) ≥ 0

Summary statistics across all games:

Mean, median, std, min, max Nash gap
Mean LLM and BR values
Gap ratio (normalized by BR value)

Output

The benchmark writes to the results/ directory.

Combined Benchmark Output (Recommended)

When running with --combined --overwrite, results are overwritten each run in:

results/pure_and_mixed_latest/

Files:

games.json - Shared game mapping data (100 entries)

Maps game_id to payoff matrix and Nash equilibria

trials_pure_actions.json - Pure action trials

Contains: game_id, trial_id, llm_decision, llm_value, best_response_value, nash_gap

summary_pure_actions.json - Pure action summary statistics
trials_mixed_strategy.json - Mixed strategy trials

Contains: game_id, trial_id, llm_decision (probabilities), llm_value, nash_gap

summary_mixed_strategy.json - Mixed strategy summary statistics

When running with --combined without --overwrite, a timestamped folder is created:

results/pure_and_mixed_YYYYMMDD_HHMMSS/

Single-Mode Output

When running without --combined, a run-specific folder is created:

results/run_YYYYMMDD_HHMMSS/

Files:

games.json
trials.json
summary.json

Example Output Format

Game entry:

{
  "game_id": 0,
  "payoff_matrix": [[1.5, -2.3], [-0.8, 3.1]],
  "nash_equilibrium_row": [0.45, 0.55],
  "nash_equilibrium_col": [0.60, 0.40]
}

Trial entry:

{
  "game_id": 0,
  "trial_id": 0,
  "llm_decision": 1,
  "llm_value": -0.42,
  "best_response_value": 0.15,
  "nash_gap": 0.57
}

Summary entry:

{
  "num_games": 100,
  "num_trials_per_game": 100,
  "total_trials": 10000,
  "mean_nash_gap": 23.45,
  "median_nash_gap": 20.12,
  "std_nash_gap": 15.67,
  "min_nash_gap": 0.05,
  "max_nash_gap": 67.89,
  "mean_llm_value": -5.34,
  "mean_br_value": 18.11
}

Using Different LLM Backends

The benchmark supports multiple LLM backends:

1. Together AI (Hosted Llama 3) - Recommended

export TOGETHER_API_KEY="your_key_here"
python main.py --num-games 100 --num-trials 100 --seed 42

Pros: Hosted inference, no local GPU needed
Cons: API calls cost money (~$0.001 per request)
Setup: Get API key from https://www.together.ai

2. DummyLLM (Random Baseline)

python main.py --llm-type dummy --num-games 100 --num-trials 100 --seed 42

Pros: No dependencies, fast testing
Cons: Random decisions, not a real LLM

3. OpenAI (GPT-4 / GPT-3.5)

export OPENAI_API_KEY="your_key_here"
python main.py --llm-type openai --llm-model gpt-3.5-turbo --seed 42

Pros: State-of-the-art models
Cons: Costs per request (~$0.001-0.01 per game)
Setup: Get API key from https://platform.openai.com

Tiered Research Suite (Model Tiers × Game Buckets)

To run a controlled benchmark matrix over game families and model tiers (A/B/C), use:

export TOGETHER_API_KEY="your_key_here"
python run_tiered_research_suite.py --num-games-per-bucket 30 --num-trials 20 --seeds 42

You can now control providers and game-family subset explicitly:

export TOGETHER_API_KEY="your_key_here"
export OPENAI_API_KEY="your_key_here"
python run_tiered_research_suite.py   --providers together openai   --bucket-ids 2x2_lowVar_pure 3x3_midVar_mixed 4x4_highVar_mixed   --num-games-per-bucket 8   --num-trials 4   --seeds 42   --openai-model-tiers-json '{"A":["gpt-4o"],"B":["gpt-4o-mini"],"C":[]}'

Outputs are written to:

results/tiered_suite_YYYYMMDD_HHMMSS/

Key artifacts:

big_table_all_runs.csv: one row per seed × provider × bucket × model × mode
big_table_aggregated.csv: means/std across seeds (main report table)
big_table_aggregated.md: markdown version of the aggregated table
suite_metadata.json: controlled variables (providers, seeds, temperature, buckets, tier definitions)

For a local dry-run without API calls:

python run_tiered_research_suite.py --use-dummy --num-games-per-bucket 5 --num-trials 3 --seeds 42

Testing

Run unit tests:

PYTHONPATH=. python tests/test_core.py

Notes

Nash equilibrium computation uses linear programming (scipy.optimize.linprog)
Supports both pure action and mixed strategy LLM outputs
Results are deterministic when seeds are fixed
Response parsing is basic; you may need to customize for specific LLMs

Future Work

More sophisticated response parsing
Analysis and visualization notebooks
Support for non-zero-sum games
Batch game generation with specific properties
Sensitivity analysis

References

Zero-sum games and Nash equilibrium: https://en.wikipedia.org/wiki/Zero-sum_game
Mixed strategy Nash equilibrium: https://en.wikipedia.org/wiki/Nash_equilibrium

License

TODO: Add license information

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
FLOW_EXPLANATION.md		FLOW_EXPLANATION.md
README.md		README.md
RESEARCH_SETUP.md		RESEARCH_SETUP.md
analyze_performance.py		analyze_performance.py
examples_llm_backends.py		examples_llm_backends.py
main.py		main.py
requirements.txt		requirements.txt
run_research_benchmark.py		run_research_benchmark.py
run_tiered_research_suite.py		run_tiered_research_suite.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Game Theory Benchmark

Project Overview

Key Concepts

Project Structure

Installation

Quick Start

Using Together AI (Hosted Llama 3)

Using Dummy LLM (Random baseline)

Command Line Options

Examples

Benchmark Protocol

Metrics

Output

Combined Benchmark Output (Recommended)

Single-Mode Output

Example Output Format

Using Different LLM Backends

1. Together AI (Hosted Llama 3) - Recommended

2. DummyLLM (Random Baseline)

3. OpenAI (GPT-4 / GPT-3.5)

Tiered Research Suite (Model Tiers × Game Buckets)

Testing

Notes

Future Work

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Game Theory Benchmark

Project Overview

Key Concepts

Project Structure

Installation

Quick Start

Using Together AI (Hosted Llama 3)

Using Dummy LLM (Random baseline)

Command Line Options

Examples

Benchmark Protocol

Metrics

Output

Combined Benchmark Output (Recommended)

Single-Mode Output

Example Output Format

Using Different LLM Backends

1. Together AI (Hosted Llama 3) - Recommended

2. DummyLLM (Random Baseline)

3. OpenAI (GPT-4 / GPT-3.5)

Tiered Research Suite (Model Tiers × Game Buckets)

Testing

Notes

Future Work

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages