Skip to content

zehao-zhao/reasoning_in_game

Repository files navigation

LLM Game Theory Benchmark

Measure how well a raw LLM plays in simple strategic settings before any post-training.

Project Overview

This project implements a benchmark for evaluating LLM performance in zero-sum matrix games. The benchmark computes the "Nash gap" — the difference between what the LLM achieves and the best response payoff against a Nash equilibrium opponent.

Key Concepts

  • Matrix Game: A two-player zero-sum game with finite action spaces represented as a matrix
  • Nash Equilibrium: A pair of mixed strategies where neither player can improve by unilateral deviation
  • Nash Gap: The difference between best response value and LLM's value against Nash opponent
    • Gap = BR_value - LLM_value ≥ 0
    • Gap = 0 means LLM plays optimally
    • Larger gaps indicate worse performance

Project Structure

.
├── src/
│   ├── __init__.py
│   ├── game_generator.py      # Generate random matrix games
│   ├── nash_solver.py         # Compute Nash equilibrium using LP
│   ├── llm_interface.py       # LLM query interface and response parsing
│   └── benchmark.py           # Benchmark runner and metrics
├── tests/                      # Unit tests
├── notebooks/                  # Jupyter notebooks for analysis
├── main.py                     # Main benchmark runner
├── requirements.txt            # Python dependencies
└── README.md                   # This file

Installation

  1. Create a Python environment (Python 3.8+)
  2. Install dependencies:
pip install -r requirements.txt

Optional analysis dependencies (for analyze_performance.py):

pip install pandas matplotlib

Quick Start

Using Together AI (Hosted Llama 3)

Set your API key and run a combined benchmark (pure + mixed):

export TOGETHER_API_KEY="your_key_here"
python main.py --num-games 10 --num-trials 10 --combined --seed 42

By default, this writes results to a timestamped folder:

results/pure_and_mixed_YYYYMMDD_HHMMSS/

To overwrite a fixed folder each run:

python main.py --num-games 10 --num-trials 10 --combined --overwrite --seed 42

Overwritten results go to:

results/pure_and_mixed_latest/

Using Dummy LLM (Random baseline)

For quick testing without API calls:

python main.py --num-games 10 --num-trials 10 --llm-type dummy --seed 42

Command Line Options

  • --num-games: Number of games to generate (default: 100)
  • --num-trials: Number of trials per game (default: 100)
  • --num-rows: Number of row player actions (default: 3)
  • --num-cols: Number of column player actions (default: 3)
  • --seed: Random seed for game generation
  • --llm-seed: Random seed for LLM (DummyLLM only)
  • --llm-type: LLM backend type: dummy, together, openai (default: together)
  • --llm-model: Specific model name (default: Together Llama 3.1 70B Turbo)
  • --output-dir: Directory to save results (default: "results")
  • --parallel: Enable parallel workers for faster execution (especially useful for network-based LLMs)
  • --num-workers: Number of parallel workers (default: 4, max recommended: 8-16)
  • --combined: Run both pure action and mixed strategy benchmarks on the same games (uses Together AI)
  • --overwrite: Overwrite combined results in results/pure_and_mixed_latest/ (default is a timestamped folder)

Examples

Run 100 games × 100 trials with Together AI:

export TOGETHER_API_KEY="your_key_here"
python main.py --num-games 100 --num-trials 100 --seed 42

Run smaller experiment with dummy LLM (for testing):

python main.py --num-games 10 --num-trials 10 --llm-type dummy --seed 42

Custom game size (2x2 games) with Together AI:

export TOGETHER_API_KEY="your_key_here"
python main.py --num-games 50 --num-trials 50 --num-rows 2 --num-cols 2 --seed 123

Parallel execution (4-10x faster for network-based LLMs):

# Uses 4 parallel workers (adjust with --num-workers)
python main.py --num-games 100 --num-trials 100 --llm-type together --parallel --num-workers 8 --seed 42

Faster execution tips:

  • Reduce trials: --num-trials 20 (5x faster)
  • Use parallelization: --parallel --num-workers 8 (best for remote LLMs)
  • Together AI: --llm-type together --parallel (good balance of speed & accuracy)

Benchmark Protocol

For each game G with payoff matrix U:

  1. Generate Game: Create random payoff matrix
  2. Query LLM: Prompt LLM with game and ask for action (pure or mixed)
  3. Compute Nash: Calculate Nash equilibrium (π₁*, π₂*)
  4. Fix Opponent: Opponent plays Nash strategy π₂*
  5. Measure Gap: Compute gap(G) = V_BR(G) - V_LLM(G)

Metrics

For each game, the benchmark computes:

  • LLM Value: V_LLM(G) = E[a ~ π_LLM, b ~ π₂*][U(a,b)]
  • Best Response Value: V_BR(G) = max_a E[b ~ π₂*][U(a,b)]
  • Nash Gap: gap(G) = V_BR(G) - V_LLM(G) ≥ 0

Summary statistics across all games:

  • Mean, median, std, min, max Nash gap
  • Mean LLM and BR values
  • Gap ratio (normalized by BR value)

Output

The benchmark writes to the results/ directory.

Combined Benchmark Output (Recommended)

When running with --combined --overwrite, results are overwritten each run in:

results/pure_and_mixed_latest/

Files:

  1. games.json - Shared game mapping data (100 entries)
  • Maps game_id to payoff matrix and Nash equilibria
  1. trials_pure_actions.json - Pure action trials
  • Contains: game_id, trial_id, llm_decision, llm_value, best_response_value, nash_gap
  1. summary_pure_actions.json - Pure action summary statistics

  2. trials_mixed_strategy.json - Mixed strategy trials

  • Contains: game_id, trial_id, llm_decision (probabilities), llm_value, nash_gap
  1. summary_mixed_strategy.json - Mixed strategy summary statistics

When running with --combined without --overwrite, a timestamped folder is created:

results/pure_and_mixed_YYYYMMDD_HHMMSS/

Single-Mode Output

When running without --combined, a run-specific folder is created:

results/run_YYYYMMDD_HHMMSS/

Files:

  1. games.json
  2. trials.json
  3. summary.json

Example Output Format

Game entry:

{
  "game_id": 0,
  "payoff_matrix": [[1.5, -2.3], [-0.8, 3.1]],
  "nash_equilibrium_row": [0.45, 0.55],
  "nash_equilibrium_col": [0.60, 0.40]
}

Trial entry:

{
  "game_id": 0,
  "trial_id": 0,
  "llm_decision": 1,
  "llm_value": -0.42,
  "best_response_value": 0.15,
  "nash_gap": 0.57
}

Summary entry:

{
  "num_games": 100,
  "num_trials_per_game": 100,
  "total_trials": 10000,
  "mean_nash_gap": 23.45,
  "median_nash_gap": 20.12,
  "std_nash_gap": 15.67,
  "min_nash_gap": 0.05,
  "max_nash_gap": 67.89,
  "mean_llm_value": -5.34,
  "mean_br_value": 18.11
}

Using Different LLM Backends

The benchmark supports multiple LLM backends:

1. Together AI (Hosted Llama 3) - Recommended

export TOGETHER_API_KEY="your_key_here"
python main.py --num-games 100 --num-trials 100 --seed 42
  • Pros: Hosted inference, no local GPU needed
  • Cons: API calls cost money (~$0.001 per request)
  • Setup: Get API key from https://www.together.ai

2. DummyLLM (Random Baseline)

python main.py --llm-type dummy --num-games 100 --num-trials 100 --seed 42
  • Pros: No dependencies, fast testing
  • Cons: Random decisions, not a real LLM

3. OpenAI (GPT-4 / GPT-3.5)

export OPENAI_API_KEY="your_key_here"
python main.py --llm-type openai --llm-model gpt-3.5-turbo --seed 42

Tiered Research Suite (Model Tiers × Game Buckets)

To run a controlled benchmark matrix over game families and model tiers (A/B/C), use:

export TOGETHER_API_KEY="your_key_here"
python run_tiered_research_suite.py --num-games-per-bucket 30 --num-trials 20 --seeds 42

You can now control providers and game-family subset explicitly:

export TOGETHER_API_KEY="your_key_here"
export OPENAI_API_KEY="your_key_here"
python run_tiered_research_suite.py   --providers together openai   --bucket-ids 2x2_lowVar_pure 3x3_midVar_mixed 4x4_highVar_mixed   --num-games-per-bucket 8   --num-trials 4   --seeds 42   --openai-model-tiers-json '{"A":["gpt-4o"],"B":["gpt-4o-mini"],"C":[]}'

Outputs are written to:

results/tiered_suite_YYYYMMDD_HHMMSS/

Key artifacts:

  • big_table_all_runs.csv: one row per seed × provider × bucket × model × mode
  • big_table_aggregated.csv: means/std across seeds (main report table)
  • big_table_aggregated.md: markdown version of the aggregated table
  • suite_metadata.json: controlled variables (providers, seeds, temperature, buckets, tier definitions)

For a local dry-run without API calls:

python run_tiered_research_suite.py --use-dummy --num-games-per-bucket 5 --num-trials 3 --seeds 42

Testing

Run unit tests:

PYTHONPATH=. python tests/test_core.py

Notes

  • Nash equilibrium computation uses linear programming (scipy.optimize.linprog)
  • Supports both pure action and mixed strategy LLM outputs
  • Results are deterministic when seeds are fixed
  • Response parsing is basic; you may need to customize for specific LLMs

Future Work

  • More sophisticated response parsing
  • Analysis and visualization notebooks
  • Support for non-zero-sum games
  • Batch game generation with specific properties
  • Sensitivity analysis

References

License

TODO: Add license information

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors