rl-toybox is a compact reinforcement-learning playground built around short arcade-style games. Each game is small enough to inspect end to end, while shared code handles configuration, training, evaluation, rendering, logging, and model artifacts.
core/value_discrete/contains the shared value-based stack used bysnakeandbang.core/actor_critic/contains the shared PPO/SAC stack plus centralized-critic support used byjump,vroom, andkick.core/search_play/contains the compact MCTS, policy/value, and self-play stack used byflip.core/algorithms/contains the shared algorithm interface and cross-family helpers.core/envs/contains the base environment contract and the Arcade runtime mixin.core/shared_config.pycontains the shared runtime/window defaults used across the active games.core/game.pyowns the active game registry, compatibility checks, config composition, and shared run preparation.games/<name>/contains each game's environment, configuration, and game-specific README.
- Repo guide: docs/repo-guide.md
- RL and environment design guide: docs/rl-design-guide.md
With package install:
pip install -e .
rl-toybox-train --game bang --mode team_arena
rl-toybox-play-ai --game bang --mode team_arena --model best --render
rl-toybox-play-user --game bang --mode team_arenaWithout installation, from the repo root:
python -m scripts.train --game bang --mode team_arena
python -m scripts.play_ai --game bang --mode team_arena --model best --render
python -m scripts.play_user --game bang --mode team_arenaplay_ai loads best by default, so --model best is shown only to make the artifact choice explicit. Curriculum-based games use a shared L1 to L5 ladder, with training defaulting to L1 and play/eval/capture defaulting to L5. flip resolves to fixed L1 for training, play, evaluation, and capture because its board is not staged.
Bang has one game id and selectable combat modes:
rl-toybox-train --game bang --mode duel
rl-toybox-train --game bang --mode arena
rl-toybox-train --game bang --mode team_arenaFor Bang, mode defines the maximum format and curriculum activates more enemies as levels rise. team_arena trains two friendly RL agents with one shared DQN policy and is the recommended max-complexity training mode for a general Bang policy; duel and arena are simpler subset cases of the same 36-input / 8-action net.
Kick has one game id and selectable team-size modes:
rl-toybox-train --game kick --team-size 3
rl-toybox-train --game kick --team-size 5
rl-toybox-train --game kick --team-size 7Training prints compact single-line progress records. Ep: lines show environment performance: episode length, reward, rolling reward, best reward for the level, success, average success, and optional reward components. PPO / coach-critic runs also print Up: optimizer-health lines. PPO-style updates use Pi for policy loss, V for value loss, EV for critic explained variance, Ent for entropy, and KL for approximate KL. SAC update lines are opt-in and are quiet for Vroom by default.
EV is 1 - Var(returns - values) / Var(returns): near 1.0 is strong critic fit, around 0.0 means little baseline improvement, and negative means worse than predicting the mean.
| Game ID | Role | Family | Summary | Docs |
|---|---|---|---|---|
snake |
Intro grid-control game | value-based | Classic Snake with obstacle curriculum, compact egocentric observations, and lightweight shaping rewards | games/snake/README.md |
bang |
Flagship discrete-control arena game | value-based | Top-down arena shooter with Duel, Arena, and shared-policy Team Arena modes under one DQN IO shape; curriculum ramps active enemies |
games/bang/README.md |
jump |
Traversal platformer | actor-critic | Compact side-view micro-platformer built around short procedural runs, timing windows, and simple left/right/jump control | games/jump/README.md |
vroom |
Continuous-control racing game | actor-critic | One-lap top-down racer with procedural tracks, compact vector observations, and SAC-oriented defaults | games/vroom/README.md |
flip |
Planning + self-play capstone | search + self-play | Fixed 6x6 disc-flipping game using MCTS, self-play, legal placement masking, and a small policy/value network | games/flip/README.md |
kick |
Scalable multi-agent football | actor-critic / CTDE | Shared-policy football environment for 3v3, 5v5, and 7v7 modes with one semantic kick action and a 128-input coach critic |
games/kick/README.md |
| Step | Game | Focus | What to Look For |
|---|---|---|---|
| 1 | snake |
Q-learning / value methods | Smallest environment, discrete actions, reward shaping |
| 2 | bang |
DQN-style value control | Replay, richer observations, arcade combat dynamics |
| 3 | jump |
PPO / on-policy actor-critic | Policy gradients, advantage estimation, traversal/platforming |
| 4 | vroom |
SAC / continuous control | Continuous actions, entropy, smooth control |
| 5 | flip |
MCTS + self-play | Planning, legal actions, policy/value search |
| 6 | kick |
Multi-agent CTDE | Shared policy, centralized critic, cooperative agents |
- Arcade / egocentric control:
SELF -> SENS -> TGT/LAND/OPP -> HAZ -> FLAG - Team / CTDE control:
SELF -> TGT -> LAND -> ALLY -> OPP, with optionalMAPorFLAGblocks when a game needs them - Board self-play / search:
BOARDonly; legal moves stay outside the observation via action masking - Blocks can be omitted when they do not apply. Compact canonical prefixes are
self_,sens_,tgt_,land_,ally_,opp_,map_,haz_,flag_, andboard_.
Active examples:
snake:self_*,sens_*,tgt_*bang:self_*,sens_*,ally_*,opp*_*,haz_*jump:self_*,sens_*,land_*,opp*_*,haz_*,flag_*vroom:self_*,sens_*,flag_*kick:self_*,tgt_*,land_*,ally*_*,opp*_*flip:board_r*_c*
Per-game config.py files own the exact observation/action names, order, dimensions, model defaults, algorithm overrides, and training budgets. The standard active-game template is DEFAULT_ALGO, DEFAULT_MODEL_CONFIG, ALGO_CONFIG_OVERRIDES, and DEFAULT_TRAIN_CONFIG. DEFAULT_MODEL_CONFIG["hidden_sizes"] sets the game-wide network size across supported models, and DEFAULT_MODEL_CONFIG["critic_hidden_sizes"] sets the separate critic shape when a game uses one. ALGO_CONFIG_OVERRIDES[algo_id] is for true algorithm-specific values such as PPO entropy, DQN replay settings, or search-play simulations. DEFAULT_TRAIN_CONFIG["budget"] controls when a game's training run stops. The budget unit is total environment steps for value-based and actor-critic families, and self-play games for search_play.
snake->qlearn,obs=12,act=3, Q-network12 -> 32 -> 3bang->dqn, defaultteam_arena, selectableduel/arena/team_arena,obs=36,act=8, Q-network36 -> 64 -> 64 -> 8with double-Q, a dueling head, and prioritized replayjump->ppo,obs=36,act=4, actor36 -> 32 -> 32 -> 4, critic36 -> 32 -> 32 -> 1vroom->sac,obs=32,act=3, actor32 -> 64 -> 64 -> 3, twin critics(32 + 3) -> 64 -> 64 -> 1flip->search_play, fixed6x6,obs=36,act=36, policy/value net36 -> 48 -> 48 -> (36 + 1)kick->ppo, run taga64_64_c128_128,obs=36/player,act=10, shared actor36 -> 64 -> 64 -> 10, coach critic128 -> 128 -> 128 -> 1; scalable3v3/5v5/7v7football with one semantickickaction





