Kanishk Gandhi*, Michael Y. Li*, Lyle Goodyear, Agam Bhatia,
Louise Li, Aditi Bhaskar, Mohammed Zaman, Noah D. Goodman
“To understand a system, you must perturb it.”
– George Box
BoxingGym is a benchmarking framework designed to evaluate the capabilities of language-based agents in experimental design and model discovery. The framework consists of several simulated environments where agents can perform experiments, propose models, and refine them based on collected data.
- Key Features
- Installation
- Available Environments
- Environment Implementation Example
- Agent Evaluation Process
- Metrics and Evaluation
- Running Experiments
- Configuration System
- Analysis Tools
- Contributing
- 10+ Diverse Environments: From physical systems (Lotka-Volterra predator-prey) to psychological models (temporal discounting)
- Multiple Goal Types: Direct prediction, parameter estimation, and system identification
- Box's Loop Integration: Automated statistical model building using PyMC
- EIG-based Evaluation: Quantitative assessment of experimental design quality
- Flexible Agent Interface: Support for any LLM-based agent
- Comprehensive Metrics: Accuracy, MSE, EIG regret, and more
git clone https://github.com/kanishkg/boxing-gym.git
cd boxing-gym
pip install -e .Requirements:
- Python >= 3.11
- PyMC for probabilistic modeling
- OpenAI/Anthropic API keys for LLM agents
BoxingGym includes the following environments, each with multiple goal configurations:
| Environment | Description | Input Space | Output Space |
|---|---|---|---|
| Hyperbolic Temporal Discount | Models human decision-making between immediate and delayed rewards | (immediate_reward, delayed_reward, delay_days) | Binary choice (0/1) |
| Location Finding | Signal source localization in n-dimensional space | n-dimensional coordinates | Signal intensity |
| Death Process | Disease spread modeling in a population | Time | Number of infected |
| IRT (Item Response Theory) | Student-question performance modeling | (student_id, question_id) | Correctness (0/1) |
| Survival Analysis | Breast cancer patient survival prediction | (metastasized, time_since_surgery) | Survival status (0/1) |
| Dugongs | Sea cow growth modeling | Age | Length |
| Peregrines | Falcon population dynamics | Time | Population count |
| Lotka-Volterra | Predator-prey population dynamics | Time | (prey_count, predator_count) |
| Moral Machines | Ethical decision-making in autonomous vehicles | (group1, group2, intervention) | Choice (1/2) |
| Emotion | Emotion prediction from gambling outcomes | (prizes, probabilities, outcome) | Emotion ratings |
Let's examine the Hyperbolic Temporal Discount environment in detail to understand how environments work:
class TemporalDiscount:
def __init__(self, epsilon=0.01, k_mean=-4.25, k_std=0.5, alpha_scale=2):
# Parameters define the prior distributions
self.epsilon = epsilon # Noise parameter
self.k_mean = k_mean # Mean of log-normal discount factor
self.k_std = k_std # Std of log-normal discount factor
self.alpha_scale = alpha_scale # Scale for decision noise
self.reset()
def reset(self):
# Sample true parameters from prior
log_k = np.random.normal(self.k_mean, self.k_std)
k = np.exp(log_k) # Discount factor
alpha = halfnorm.rvs(scale=self.alpha_scale) # Decision noise
self.truth = (k, alpha)
self.observed_data = []The environment models how people choose between immediate and delayed rewards:
def step(self, iR, dR, Days):
k, alpha = self.truth
# Calculate subjective values
V0 = iR # Value of immediate reward
V1 = dR / (1 + k * Days) # Discounted value of delayed reward
# Probabilistic choice using probit model
z = (V1 - V0) / alpha
probability = self.epsilon + (1 - 2 * self.epsilon) * norm.cdf(z)
choice = np.random.binomial(n=1, p=probability)
return choice # 1 = choose delayed, 0 = choose immediateAgents interact through natural language with optional prior knowledge:
def generate_system_message(self, include_prior=True, goal=None):
if include_prior:
# Provide context about human decision-making
message = """A person has to choose between a delayed reward dR dollars
in x days and an immediate reward iR dollars today.
Your goal is to predict their choices.
Make observations by specifying [iR, dR, D]."""
else:
# Abstract version without context
message = """You are observing a binary response for a tuple of three
positive integer values. Make observations by specifying [int1, int2, int3]."""
return messageEach environment supports multiple goals:
- DirectGoal: Predict outcomes for new inputs
- DiscountGoal: Estimate the discount factor k
- DirectGoalNaive: Explain findings to another agent
The evaluation process follows these steps:
# Create environment and goal
env = TemporalDiscount()
goal = DirectGoal(env)
# Initialize agent with system message
agent = LMExperimenter(model_name="gpt-4o")
system_message = goal.get_system_message(include_prior=True)
agent.set_system_message(system_message)for i in range(num_experiments):
# Agent designs experiment
observation_request = agent.generate_actions(previous_result)
# Example: "[50, 100, 7]" (choose between $50 now or $100 in 7 days)
# Environment provides result
result, success = env.run_experiment(observation_request)
# Result: 1 (person chose to wait)
# Store for analysis
observations.append((observation_request, result))# Generate evaluation questions
for _ in range(num_evals):
question, ground_truth = goal.get_goal_eval_question(include_prior)
# Example: "What choice for iR=75, dR=100, D=5?"
prediction = agent.generate_predictions(question)
predictions.append(prediction)
ground_truths.append(ground_truth)
# Calculate metrics
accuracy, std = goal.evaluate_predictions(predictions, ground_truths)BoxingGym uses several metrics to evaluate agent performance:
Predictive error captures how well an agent can forecast the outcome of new, unseen trials after it has finished experimenting. For every evaluation question we compare the agent’s prediction (\hat{y}) with the environment-generated ground-truth (y).
• We report mean-squared-error (MSE) together with its standard deviation for all environments—binary, count, or continuous. Lower is better.
A concise reference implementation looks like this:
import numpy as np
def predictive_error(predictions, ground_truths):
preds = np.asarray(predictions, dtype=float)
gts = np.asarray(ground_truths, dtype=float)
mse = np.mean((preds - gts) ** 2)
std = np.std((preds - gts) ** 2) / np.sqrt(len(gts))
return mse, stdThis metric provides a direct measure of how well the agent has learned the underlying input-output relationship, independent of the optimality of its experimental designs.
For OED tasks, we calculate the information gain of each experiment:
def expected_information_gain(self, query_point, num_outer_samples=1000):
# EIG = E_y[KL(p(\theta|y,x) || p([theta]))]
# Estimated using nested Monte Carlo
# 1. Sample parameters from posterior given existing data
posterior_samples = self.get_posterior_samples(existing_data)
# 2. For each parameter sample, calculate:
eig_samples = []
for theta in posterior_samples:
# Likelihood of observing y given theta and query x
y_sample = self.simulate_outcome(query_point, theta)
log_likelihood = self.log_prob(y_sample | theta, query_point)
# Marginal likelihood p(y|x)
log_marginal = self.estimate_marginal_likelihood(y_sample, query_point)
eig_samples.append(log_likelihood - log_marginal)
return np.mean(eig_samples)Compares agent's experimental choices to optimal designs:
# For each agent query
agent_eig = goal.expected_information_gain(agent_query)
# Find optimal query
optimal_query = max(random_queries, key=lambda q: goal.expected_information_gain(q))
optimal_eig = goal.expected_information_gain(optimal_query)
# Regret = missed information
regret = optimal_eig - agent_eigThis metric is computed only when experiment_type="discovery" – i.e. the direct_discovery goal variants.
It measures how effectively a scientist agent can convey its understanding of the environment to a naive agent that is not allowed to run any experiments:
- After finishing its experiments, the scientist receives a prompt from
Goal.get_comm_prompt(respecting thecom_limitword budget) and produces a natural-language explanation. - The naive agent is initialized with a system prompt containing
Goal.get_naive_system_messagefollowed only by the scientist's explanation – it does not see the raw data or experiment log. - The naive agent is then evaluated on the standard prediction questions using
evaluate(...). Its accuracy (and standard deviation) constitute the communication score.
# 1. Scientist writes explanation
prompt = goal.get_comm_prompt(com_limit=200, include_prior=True)
explanation = scientist.prompt_llm(prompt)
# 2. Build naive agent
system_msg = goal.get_naive_system_message(include_prior=True) + explanation
naive_agent.set_system_message(system_msg)
# 3. Evaluate naive agent
(comm_acc, comm_std), _, _, _ = evaluate(
final_results, goal, naive_agent, num_evals, include_prior=True
)Higher accuracy indicates clearer, more informative explanations produced by the scientist agent.
python run_experiment.py \
seed=1 \
llms=openai \
include_prior=true \
exp=oed \
envs=hyperbolic_directUse the provided shell scripts:
# Run all hyperbolic discount experiments
bash scripts/hyperbolic.sh
# Run with Box's Loop for model building
python run_experiment.py \
seed=1 \
llms=openai \
include_prior=true \
exp=oed \
envs=hyperbolic_direct \
use_ppl=true# First run experiments
python run_experiment.py ...
# Then calculate EIG regret
python run_eig_regret.py \
seed=1 \
num_random=100 \ # number of samples for the MC estimate
box=falseBoxingGym uses Hydra for configuration management:
conf/
├── config.yaml # Main configuration
├── llms/ # LLM configurations
│ └── openai.yaml
├── exp/ # Experiment types
│ ├── oed.yaml
│ └── discovery.yaml
└── envs/ # Environment configs
├── hyperbolic_direct.yaml
└── ...
conf/envs/hyperbolic_direct.yaml:
num_evals: 10 # Number of evaluation questions
env_name: "hyperbolic_temporal_discount"
goal_name: "direct" # Goal type
com_limit: 200 # Word limit for explanations
env_params: # Environment-specific parameters
epsilon: 0.01
k_mean: -4.25
k_std: 0.5
alpha_scale: 2# conf/exp/my_experiment.yaml
num_experiments: [0, 5, 10, 20] # Experiment budgets to test
experiment_type: "oed"Results are saved as JSON files:
results/
└── hyperbolic_temporal_discount/
├── direct_gpt-4o_oed_true_1.json # Main results
└── regret_direct_gpt-4o_oed_true_1.json # EIG analysis
{
"config": {...},
"data": {
"results": [[accuracy, std], ...],
"queries": ["[10, 20, 5]", ...],
"observations": [1, 0, ...],
"successes": [true, true, ...],
"eigs": [0.23, 0.18, ...],
"programs": ["PyMC model code..."] // If using Box's Loop
},
"scientist_messages": [...],
"naive_messages": [...] // For discovery tasks
}The analysis/ directory contains Jupyter notebooks for:
- Plotting learning curves across number of experiments
- Comparing agent performance
- Visualizing EIG Regret
- Statistical significance testing
When use_ppl=true, the system uses Box's Loop for automated model building:
- Data Collection: Agent experiments generate dataset
- Model Proposal: LLM proposes PyMC statistical models
- Model Evaluation: Models scored using LOO-CV
- Model Criticism: LLM analyzes discrepancies
- Iteration: Process repeats with refinements
This provides agents with explicit probabilistic models to guide experimentation.
We welcome contributions! Areas of interest:
- New Environments: Implement novel experimental domains
- Goal Types: Design new evaluation objectives
- Agent Strategies: Develop improved experimental design algorithms
- Analysis Tools: Create visualization and statistical tools
Please follow the environment and goal interfaces described above, and include comprehensive tests and documentation.
If you want to contribute, please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch). - Make your changes.
- Commit your changes (
git commit -am 'Add new feature'). - Push to the branch (
git push origin feature-branch). - Create a new Pull Request.
- For major changes, please open an issue first to discuss what you would like to change.