🥊 BoxingGym 🥊

Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Agam Bhatia,
Louise Li, Aditi Bhaskar, Mohammed Zaman, Noah D. Goodman

Stanford University

“To understand a system, you must perturb it.”
– George Box

BoxingGym is a benchmarking framework designed to evaluate the capabilities of language-based agents in experimental design and model discovery. The framework consists of several simulated environments where agents can perform experiments, propose models, and refine them based on collected data.

Key Features

10+ Diverse Environments: From physical systems (Lotka-Volterra predator-prey) to psychological models (temporal discounting)
Multiple Goal Types: Direct prediction, parameter estimation, and system identification
Box's Loop Integration: Automated statistical model building using PyMC
EIG-based Evaluation: Quantitative assessment of experimental design quality
Flexible Agent Interface: Support for any LLM-based agent
Comprehensive Metrics: Accuracy, MSE, EIG regret, and more

Installation

git clone https://github.com/kanishkg/boxing-gym.git
cd boxing-gym
pip install -e .

Requirements:

Python >= 3.11
PyMC for probabilistic modeling
OpenAI/Anthropic API keys for LLM agents

Available Environments

BoxingGym includes the following environments, each with multiple goal configurations:

Environment	Description	Input Space	Output Space
Hyperbolic Temporal Discount	Models human decision-making between immediate and delayed rewards	(immediate_reward, delayed_reward, delay_days)	Binary choice (0/1)
Location Finding	Signal source localization in n-dimensional space	n-dimensional coordinates	Signal intensity
Death Process	Disease spread modeling in a population	Time	Number of infected
IRT (Item Response Theory)	Student-question performance modeling	(student_id, question_id)	Correctness (0/1)
Survival Analysis	Breast cancer patient survival prediction	(metastasized, time_since_surgery)	Survival status (0/1)
Dugongs	Sea cow growth modeling	Age	Length
Peregrines	Falcon population dynamics	Time	Population count
Lotka-Volterra	Predator-prey population dynamics	Time	(prey_count, predator_count)
Moral Machines	Ethical decision-making in autonomous vehicles	(group1, group2, intervention)	Choice (1/2)
Emotion	Emotion prediction from gambling outcomes	(prizes, probabilities, outcome)	Emotion ratings

Environment Implementation Example

Let's examine the Hyperbolic Temporal Discount environment in detail to understand how environments work:

1. Environment Structure

class TemporalDiscount:
    def __init__(self, epsilon=0.01, k_mean=-4.25, k_std=0.5, alpha_scale=2):
        # Parameters define the prior distributions
        self.epsilon = epsilon  # Noise parameter
        self.k_mean = k_mean    # Mean of log-normal discount factor
        self.k_std = k_std      # Std of log-normal discount factor
        self.alpha_scale = alpha_scale  # Scale for decision noise
        self.reset()
        
    def reset(self):
        # Sample true parameters from prior
        log_k = np.random.normal(self.k_mean, self.k_std)
        k = np.exp(log_k)  # Discount factor
        alpha = halfnorm.rvs(scale=self.alpha_scale)  # Decision noise
        self.truth = (k, alpha)
        self.observed_data = []

2. System Dynamics

The environment models how people choose between immediate and delayed rewards:

def step(self, iR, dR, Days):
    k, alpha = self.truth
    
    # Calculate subjective values
    V0 = iR  # Value of immediate reward
    V1 = dR / (1 + k * Days)  # Discounted value of delayed reward
    
    # Probabilistic choice using probit model
    z = (V1 - V0) / alpha
    probability = self.epsilon + (1 - 2 * self.epsilon) * norm.cdf(z)
    choice = np.random.binomial(n=1, p=probability)
    return choice  # 1 = choose delayed, 0 = choose immediate

3. Agent Interaction

Agents interact through natural language with optional prior knowledge:

def generate_system_message(self, include_prior=True, goal=None):
    if include_prior:
        # Provide context about human decision-making
        message = """A person has to choose between a delayed reward dR dollars 
        in x days and an immediate reward iR dollars today.
        Your goal is to predict their choices.
        Make observations by specifying [iR, dR, D]."""
    else:
        # Abstract version without context
        message = """You are observing a binary response for a tuple of three 
        positive integer values. Make observations by specifying [int1, int2, int3]."""
    return message

4. Goal Types

Each environment supports multiple goals:

DirectGoal: Predict outcomes for new inputs
DiscountGoal: Estimate the discount factor k
DirectGoalNaive: Explain findings to another agent

Agent Evaluation Process

The evaluation process follows these steps:

1. Initialization

# Create environment and goal
env = TemporalDiscount()
goal = DirectGoal(env)

# Initialize agent with system message
agent = LMExperimenter(model_name="gpt-4o")
system_message = goal.get_system_message(include_prior=True)
agent.set_system_message(system_message)

2. Experimentation Loop

for i in range(num_experiments):
    # Agent designs experiment
    observation_request = agent.generate_actions(previous_result)
    # Example: "[50, 100, 7]" (choose between $50 now or $100 in 7 days)
    
    # Environment provides result
    result, success = env.run_experiment(observation_request)
    # Result: 1 (person chose to wait)
    
    # Store for analysis
    observations.append((observation_request, result))

3. Evaluation Phase

# Generate evaluation questions
for _ in range(num_evals):
    question, ground_truth = goal.get_goal_eval_question(include_prior)
    # Example: "What choice for iR=75, dR=100, D=5?"
    
    prediction = agent.generate_predictions(question)
    predictions.append(prediction)
    ground_truths.append(ground_truth)

# Calculate metrics
accuracy, std = goal.evaluate_predictions(predictions, ground_truths)

Metrics and Evaluation

BoxingGym uses several metrics to evaluate agent performance:

1. Predictive Error

Predictive error captures how well an agent can forecast the outcome of new, unseen trials after it has finished experimenting. For every evaluation question we compare the agent’s prediction (\hat{y}) with the environment-generated ground-truth (y).

• We report mean-squared-error (MSE) together with its standard deviation for all environments—binary, count, or continuous. Lower is better.

A concise reference implementation looks like this:

import numpy as np

def predictive_error(predictions, ground_truths):
    preds = np.asarray(predictions, dtype=float)
    gts = np.asarray(ground_truths, dtype=float)

    mse = np.mean((preds - gts) ** 2)
    std = np.std((preds - gts) ** 2) / np.sqrt(len(gts))
    return mse, std

This metric provides a direct measure of how well the agent has learned the underlying input-output relationship, independent of the optimality of its experimental designs.

2. Expected Information Gain (EIG)

For OED tasks, we calculate the information gain of each experiment:

def expected_information_gain(self, query_point, num_outer_samples=1000):
    # EIG = E_y[KL(p(\theta|y,x) || p([theta]))]
    # Estimated using nested Monte Carlo
    
    # 1. Sample parameters from posterior given existing data
    posterior_samples = self.get_posterior_samples(existing_data)
    
    # 2. For each parameter sample, calculate:
    eig_samples = []
    for theta in posterior_samples:
        # Likelihood of observing y given theta and query x
        y_sample = self.simulate_outcome(query_point, theta)
        log_likelihood = self.log_prob(y_sample | theta, query_point)
        
        # Marginal likelihood p(y|x)
        log_marginal = self.estimate_marginal_likelihood(y_sample, query_point)
        
        eig_samples.append(log_likelihood - log_marginal)
    
    return np.mean(eig_samples)

3. EIG Regret

Compares agent's experimental choices to optimal designs:

# For each agent query
agent_eig = goal.expected_information_gain(agent_query)

# Find optimal query
optimal_query = max(random_queries, key=lambda q: goal.expected_information_gain(q))
optimal_eig = goal.expected_information_gain(optimal_query)

# Regret = missed information
regret = optimal_eig - agent_eig

4. Communication Evaluation (Discovery Setting)

This metric is computed only when experiment_type="discovery" – i.e. the direct_discovery goal variants.
It measures how effectively a scientist agent can convey its understanding of the environment to a naive agent that is not allowed to run any experiments:

After finishing its experiments, the scientist receives a prompt from Goal.get_comm_prompt (respecting the com_limit word budget) and produces a natural-language explanation.
The naive agent is initialized with a system prompt containing Goal.get_naive_system_message followed only by the scientist's explanation – it does not see the raw data or experiment log.
The naive agent is then evaluated on the standard prediction questions using evaluate(...). Its accuracy (and standard deviation) constitute the communication score.

# 1. Scientist writes explanation
prompt = goal.get_comm_prompt(com_limit=200, include_prior=True)
explanation = scientist.prompt_llm(prompt)

# 2. Build naive agent
system_msg = goal.get_naive_system_message(include_prior=True) + explanation
naive_agent.set_system_message(system_msg)

# 3. Evaluate naive agent
(comm_acc, comm_std), _, _, _ = evaluate(
    final_results, goal, naive_agent, num_evals, include_prior=True
)

Higher accuracy indicates clearer, more informative explanations produced by the scientist agent.

Running Experiments

Basic Experiment

python run_experiment.py \
    seed=1 \
    llms=openai \
    include_prior=true \
    exp=oed \
    envs=hyperbolic_direct

Batch Experiments

Use the provided shell scripts:

# Run all hyperbolic discount experiments
bash scripts/hyperbolic.sh

# Run with Box's Loop for model building
python run_experiment.py \
    seed=1 \
    llms=openai \
    include_prior=true \
    exp=oed \
    envs=hyperbolic_direct \
    use_ppl=true

EIG Regret Analysis

# First run experiments
python run_experiment.py ...

# Then calculate EIG regret
python run_eig_regret.py \
    seed=1 \
    num_random=100 \ # number of samples for the MC estimate
    box=false

Configuration System

BoxingGym uses Hydra for configuration management:

Configuration Structure

conf/
├── config.yaml          # Main configuration
├── llms/               # LLM configurations
│   └── openai.yaml
├── exp/                # Experiment types
│   ├── oed.yaml
│   └── discovery.yaml
└── envs/               # Environment configs
    ├── hyperbolic_direct.yaml
    └── ...

Example Configuration

conf/envs/hyperbolic_direct.yaml:

num_evals: 10                    # Number of evaluation questions
env_name: "hyperbolic_temporal_discount"
goal_name: "direct"             # Goal type
com_limit: 200                  # Word limit for explanations
env_params:                     # Environment-specific parameters
  epsilon: 0.01
  k_mean: -4.25
  k_std: 0.5
  alpha_scale: 2

Creating Custom Configurations

# conf/exp/my_experiment.yaml
num_experiments: [0, 5, 10, 20]  # Experiment budgets to test
experiment_type: "oed"

Analysis Tools

Results Structure

Results are saved as JSON files:

results/
└── hyperbolic_temporal_discount/
    ├── direct_gpt-4o_oed_true_1.json      # Main results
    └── regret_direct_gpt-4o_oed_true_1.json  # EIG analysis

Result File Contents

{
  "config": {...},
  "data": {
    "results": [[accuracy, std], ...],
    "queries": ["[10, 20, 5]", ...],
    "observations": [1, 0, ...],
    "successes": [true, true, ...],
    "eigs": [0.23, 0.18, ...],
    "programs": ["PyMC model code..."]  // If using Box's Loop
  },
  "scientist_messages": [...],
  "naive_messages": [...]  // For discovery tasks
}

Analysis Scripts

The analysis/ directory contains Jupyter notebooks for:

Plotting learning curves across number of experiments
Comparing agent performance
Visualizing EIG Regret
Statistical significance testing

Box's Loop Integration

When use_ppl=true, the system uses Box's Loop for automated model building:

Data Collection: Agent experiments generate dataset
Model Proposal: LLM proposes PyMC statistical models
Model Evaluation: Models scored using LOO-CV
Model Criticism: LLM analyzes discrepancies
Iteration: Process repeats with refinements

This provides agents with explicit probabilistic models to guide experimentation.

Contributing

We welcome contributions! Areas of interest:

New Environments: Implement novel experimental domains
Goal Types: Design new evaluation objectives
Agent Strategies: Develop improved experimental design algorithms
Analysis Tools: Create visualization and statistical tools

Please follow the environment and goal interfaces described above, and include comprehensive tests and documentation.

If you want to contribute, please follow these steps:

Fork the repository.
Create a new branch (git checkout -b feature-branch).
Make your changes.
Commit your changes (git commit -am 'Add new feature').
Push to the branch (git push origin feature-branch).
Create a new Pull Request.
For major changes, please open an issue first to discuss what you would like to change.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
analysis		analysis
conf		conf
scripts		scripts
src/boxing_gym		src/boxing_gym
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_eig_regret.py		run_eig_regret.py
run_experiment.py		run_experiment.py
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

🥊 BoxingGym 🥊

Kanishk Gandhi*, Michael Y. Li*, Lyle Goodyear, Agam Bhatia, Louise Li, Aditi Bhaskar, Mohammed Zaman, Noah D. Goodman

Stanford University

📚 Table of Contents

Key Features

Installation

Available Environments

Environment Implementation Example

1. Environment Structure

2. System Dynamics

3. Agent Interaction

4. Goal Types

Agent Evaluation Process

1. Initialization

2. Experimentation Loop

3. Evaluation Phase

Metrics and Evaluation

1. Predictive Error

2. Expected Information Gain (EIG)

3. EIG Regret

4. Communication Evaluation (Discovery Setting)

Running Experiments

Basic Experiment

Batch Experiments

EIG Regret Analysis

Configuration System

Configuration Structure

Example Configuration

Creating Custom Configurations

Analysis Tools

Results Structure

Result File Contents

Analysis Scripts

Box's Loop Integration

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Agam Bhatia,
Louise Li, Aditi Bhaskar, Mohammed Zaman, Noah D. Goodman

Packages