Skip to content

kanishkg/boxing-gym

Repository files navigation

🥊 BoxingGym 🥊

Kanishk Gandhi*, Michael Y. Li*, Lyle Goodyear, Agam Bhatia,
Louise Li, Aditi Bhaskar, Mohammed Zaman, Noah D. Goodman

Stanford University

Version License arXiv

“To understand a system, you must perturb it.”
– George Box

BoxingGym is a benchmarking framework designed to evaluate the capabilities of language-based agents in experimental design and model discovery. The framework consists of several simulated environments where agents can perform experiments, propose models, and refine them based on collected data.

📚 Table of Contents

Key Features

  • 10+ Diverse Environments: From physical systems (Lotka-Volterra predator-prey) to psychological models (temporal discounting)
  • Multiple Goal Types: Direct prediction, parameter estimation, and system identification
  • Box's Loop Integration: Automated statistical model building using PyMC
  • EIG-based Evaluation: Quantitative assessment of experimental design quality
  • Flexible Agent Interface: Support for any LLM-based agent
  • Comprehensive Metrics: Accuracy, MSE, EIG regret, and more

Installation

git clone https://github.com/kanishkg/boxing-gym.git
cd boxing-gym
pip install -e .

Requirements:

  • Python >= 3.11
  • PyMC for probabilistic modeling
  • OpenAI/Anthropic API keys for LLM agents

Available Environments

BoxingGym includes the following environments, each with multiple goal configurations:

Environment Description Input Space Output Space
Hyperbolic Temporal Discount Models human decision-making between immediate and delayed rewards (immediate_reward, delayed_reward, delay_days) Binary choice (0/1)
Location Finding Signal source localization in n-dimensional space n-dimensional coordinates Signal intensity
Death Process Disease spread modeling in a population Time Number of infected
IRT (Item Response Theory) Student-question performance modeling (student_id, question_id) Correctness (0/1)
Survival Analysis Breast cancer patient survival prediction (metastasized, time_since_surgery) Survival status (0/1)
Dugongs Sea cow growth modeling Age Length
Peregrines Falcon population dynamics Time Population count
Lotka-Volterra Predator-prey population dynamics Time (prey_count, predator_count)
Moral Machines Ethical decision-making in autonomous vehicles (group1, group2, intervention) Choice (1/2)
Emotion Emotion prediction from gambling outcomes (prizes, probabilities, outcome) Emotion ratings

Environment Implementation Example

Let's examine the Hyperbolic Temporal Discount environment in detail to understand how environments work:

1. Environment Structure

class TemporalDiscount:
    def __init__(self, epsilon=0.01, k_mean=-4.25, k_std=0.5, alpha_scale=2):
        # Parameters define the prior distributions
        self.epsilon = epsilon  # Noise parameter
        self.k_mean = k_mean    # Mean of log-normal discount factor
        self.k_std = k_std      # Std of log-normal discount factor
        self.alpha_scale = alpha_scale  # Scale for decision noise
        self.reset()
        
    def reset(self):
        # Sample true parameters from prior
        log_k = np.random.normal(self.k_mean, self.k_std)
        k = np.exp(log_k)  # Discount factor
        alpha = halfnorm.rvs(scale=self.alpha_scale)  # Decision noise
        self.truth = (k, alpha)
        self.observed_data = []

2. System Dynamics

The environment models how people choose between immediate and delayed rewards:

def step(self, iR, dR, Days):
    k, alpha = self.truth
    
    # Calculate subjective values
    V0 = iR  # Value of immediate reward
    V1 = dR / (1 + k * Days)  # Discounted value of delayed reward
    
    # Probabilistic choice using probit model
    z = (V1 - V0) / alpha
    probability = self.epsilon + (1 - 2 * self.epsilon) * norm.cdf(z)
    choice = np.random.binomial(n=1, p=probability)
    return choice  # 1 = choose delayed, 0 = choose immediate

3. Agent Interaction

Agents interact through natural language with optional prior knowledge:

def generate_system_message(self, include_prior=True, goal=None):
    if include_prior:
        # Provide context about human decision-making
        message = """A person has to choose between a delayed reward dR dollars 
        in x days and an immediate reward iR dollars today.
        Your goal is to predict their choices.
        Make observations by specifying [iR, dR, D]."""
    else:
        # Abstract version without context
        message = """You are observing a binary response for a tuple of three 
        positive integer values. Make observations by specifying [int1, int2, int3]."""
    return message

4. Goal Types

Each environment supports multiple goals:

  • DirectGoal: Predict outcomes for new inputs
  • DiscountGoal: Estimate the discount factor k
  • DirectGoalNaive: Explain findings to another agent

Agent Evaluation Process

The evaluation process follows these steps:

1. Initialization

# Create environment and goal
env = TemporalDiscount()
goal = DirectGoal(env)

# Initialize agent with system message
agent = LMExperimenter(model_name="gpt-4o")
system_message = goal.get_system_message(include_prior=True)
agent.set_system_message(system_message)

2. Experimentation Loop

for i in range(num_experiments):
    # Agent designs experiment
    observation_request = agent.generate_actions(previous_result)
    # Example: "[50, 100, 7]" (choose between $50 now or $100 in 7 days)
    
    # Environment provides result
    result, success = env.run_experiment(observation_request)
    # Result: 1 (person chose to wait)
    
    # Store for analysis
    observations.append((observation_request, result))

3. Evaluation Phase

# Generate evaluation questions
for _ in range(num_evals):
    question, ground_truth = goal.get_goal_eval_question(include_prior)
    # Example: "What choice for iR=75, dR=100, D=5?"
    
    prediction = agent.generate_predictions(question)
    predictions.append(prediction)
    ground_truths.append(ground_truth)

# Calculate metrics
accuracy, std = goal.evaluate_predictions(predictions, ground_truths)

Metrics and Evaluation

BoxingGym uses several metrics to evaluate agent performance:

1. Predictive Error

Predictive error captures how well an agent can forecast the outcome of new, unseen trials after it has finished experimenting. For every evaluation question we compare the agent’s prediction (\hat{y}) with the environment-generated ground-truth (y).

• We report mean-squared-error (MSE) together with its standard deviation for all environments—binary, count, or continuous. Lower is better.

A concise reference implementation looks like this:

import numpy as np

def predictive_error(predictions, ground_truths):
    preds = np.asarray(predictions, dtype=float)
    gts = np.asarray(ground_truths, dtype=float)

    mse = np.mean((preds - gts) ** 2)
    std = np.std((preds - gts) ** 2) / np.sqrt(len(gts))
    return mse, std

This metric provides a direct measure of how well the agent has learned the underlying input-output relationship, independent of the optimality of its experimental designs.

2. Expected Information Gain (EIG)

For OED tasks, we calculate the information gain of each experiment:

def expected_information_gain(self, query_point, num_outer_samples=1000):
    # EIG = E_y[KL(p(\theta|y,x) || p([theta]))]
    # Estimated using nested Monte Carlo
    
    # 1. Sample parameters from posterior given existing data
    posterior_samples = self.get_posterior_samples(existing_data)
    
    # 2. For each parameter sample, calculate:
    eig_samples = []
    for theta in posterior_samples:
        # Likelihood of observing y given theta and query x
        y_sample = self.simulate_outcome(query_point, theta)
        log_likelihood = self.log_prob(y_sample | theta, query_point)
        
        # Marginal likelihood p(y|x)
        log_marginal = self.estimate_marginal_likelihood(y_sample, query_point)
        
        eig_samples.append(log_likelihood - log_marginal)
    
    return np.mean(eig_samples)

3. EIG Regret

Compares agent's experimental choices to optimal designs:

# For each agent query
agent_eig = goal.expected_information_gain(agent_query)

# Find optimal query
optimal_query = max(random_queries, key=lambda q: goal.expected_information_gain(q))
optimal_eig = goal.expected_information_gain(optimal_query)

# Regret = missed information
regret = optimal_eig - agent_eig

4. Communication Evaluation (Discovery Setting)

This metric is computed only when experiment_type="discovery" – i.e. the direct_discovery goal variants.
It measures how effectively a scientist agent can convey its understanding of the environment to a naive agent that is not allowed to run any experiments:

  1. After finishing its experiments, the scientist receives a prompt from Goal.get_comm_prompt (respecting the com_limit word budget) and produces a natural-language explanation.
  2. The naive agent is initialized with a system prompt containing Goal.get_naive_system_message followed only by the scientist's explanation – it does not see the raw data or experiment log.
  3. The naive agent is then evaluated on the standard prediction questions using evaluate(...). Its accuracy (and standard deviation) constitute the communication score.
# 1. Scientist writes explanation
prompt = goal.get_comm_prompt(com_limit=200, include_prior=True)
explanation = scientist.prompt_llm(prompt)

# 2. Build naive agent
system_msg = goal.get_naive_system_message(include_prior=True) + explanation
naive_agent.set_system_message(system_msg)

# 3. Evaluate naive agent
(comm_acc, comm_std), _, _, _ = evaluate(
    final_results, goal, naive_agent, num_evals, include_prior=True
)

Higher accuracy indicates clearer, more informative explanations produced by the scientist agent.

Running Experiments

Basic Experiment

python run_experiment.py \
    seed=1 \
    llms=openai \
    include_prior=true \
    exp=oed \
    envs=hyperbolic_direct

Batch Experiments

Use the provided shell scripts:

# Run all hyperbolic discount experiments
bash scripts/hyperbolic.sh

# Run with Box's Loop for model building
python run_experiment.py \
    seed=1 \
    llms=openai \
    include_prior=true \
    exp=oed \
    envs=hyperbolic_direct \
    use_ppl=true

EIG Regret Analysis

# First run experiments
python run_experiment.py ...

# Then calculate EIG regret
python run_eig_regret.py \
    seed=1 \
    num_random=100 \ # number of samples for the MC estimate
    box=false

Configuration System

BoxingGym uses Hydra for configuration management:

Configuration Structure

conf/
├── config.yaml          # Main configuration
├── llms/               # LLM configurations
│   └── openai.yaml
├── exp/                # Experiment types
│   ├── oed.yaml
│   └── discovery.yaml
└── envs/               # Environment configs
    ├── hyperbolic_direct.yaml
    └── ...

Example Configuration

conf/envs/hyperbolic_direct.yaml:

num_evals: 10                    # Number of evaluation questions
env_name: "hyperbolic_temporal_discount"
goal_name: "direct"             # Goal type
com_limit: 200                  # Word limit for explanations
env_params:                     # Environment-specific parameters
  epsilon: 0.01
  k_mean: -4.25
  k_std: 0.5
  alpha_scale: 2

Creating Custom Configurations

# conf/exp/my_experiment.yaml
num_experiments: [0, 5, 10, 20]  # Experiment budgets to test
experiment_type: "oed"

Analysis Tools

Results Structure

Results are saved as JSON files:

results/
└── hyperbolic_temporal_discount/
    ├── direct_gpt-4o_oed_true_1.json      # Main results
    └── regret_direct_gpt-4o_oed_true_1.json  # EIG analysis

Result File Contents

{
  "config": {...},
  "data": {
    "results": [[accuracy, std], ...],
    "queries": ["[10, 20, 5]", ...],
    "observations": [1, 0, ...],
    "successes": [true, true, ...],
    "eigs": [0.23, 0.18, ...],
    "programs": ["PyMC model code..."]  // If using Box's Loop
  },
  "scientist_messages": [...],
  "naive_messages": [...]  // For discovery tasks
}

Analysis Scripts

The analysis/ directory contains Jupyter notebooks for:

  • Plotting learning curves across number of experiments
  • Comparing agent performance
  • Visualizing EIG Regret
  • Statistical significance testing

Box's Loop Integration

When use_ppl=true, the system uses Box's Loop for automated model building:

  1. Data Collection: Agent experiments generate dataset
  2. Model Proposal: LLM proposes PyMC statistical models
  3. Model Evaluation: Models scored using LOO-CV
  4. Model Criticism: LLM analyzes discrepancies
  5. Iteration: Process repeats with refinements

This provides agents with explicit probabilistic models to guide experimentation.

Contributing

We welcome contributions! Areas of interest:

  • New Environments: Implement novel experimental domains
  • Goal Types: Design new evaluation objectives
  • Agent Strategies: Develop improved experimental design algorithms
  • Analysis Tools: Create visualization and statistical tools

Please follow the environment and goal interfaces described above, and include comprehensive tests and documentation.

If you want to contribute, please follow these steps:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature-branch).
  3. Make your changes.
  4. Commit your changes (git commit -am 'Add new feature').
  5. Push to the branch (git push origin feature-branch).
  6. Create a new Pull Request.
  7. For major changes, please open an issue first to discuss what you would like to change.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors