Robot Action Representation

Comparing diffusion-based, transformer-based, and autoregressive policy architectures for robotic manipulation on the Push-T task using the LeRobot framework. Read the blog post for details.

Policies

Policy	Type	Description
VQ-BeT	Transformer	Built-in LeRobot VQ-BeT implementation
DiTFlow	Diffusion	Pi0-inspired flow-based diffusion transformer with velocity ODE formulation and adaptive layer normalization
ARBeT	Autoregressive	Autoregressive behaviour transformer using ScribeTokens — encodes (x,y) actions as BPE-compressed Freeman chain codes and predicts them with cross-entropy loss

Setup

Requires Python 3.12+ and uv.

uv sync

Usage

Training

make train-vqbet     # Train VQ-BeT
make train-ditflow   # Train DiTFlow
make train-arbet     # Train ARBeT

Resume from a checkpoint:

make resume POLICY=<policy> CONFIG_PATH=outputs/train/<date>/<time>_<policy>/checkpoints/<step>/pretrained_model/train_config.json

Testing & Linting

make test     # Run pytest
make lint     # Lint with ruff (auto-fix)
make format   # Format with ruff

Project Structure

├── src/
│   ├── adaln_transformer.py        # Shared AdaLN transformer blocks (DiT/Pi0-style modulation)
│   ├── scribe_tokenizer.py         # (x,y) trajectory ↔ BPE chain-code tokens (wraps tokink)
│   └── lerobot-policy-arbet/       # ARBeT policy package (editable install)
├── vendor/
│   └── lerobot-policy-ditflow/     # DiTFlow policy package (editable install)
├── scripts/
│   ├── train_arbet.py              # ARBeT training entry point (wraps lerobot_train with ScribeTokenDataset)
│   ├── make_gif.py                 # Generate GIFs from rollout videos
│   └── visualize_rollouts.py       # Visualize policy rollouts
├── tests/                          # Tests for shared modules and ARBeT
├── Makefile                        # Training, eval, and dev commands
└── pyproject.toml

Architecture

All policies share the same data pipeline:

Input: Push-T dataset — 96x96 RGB images + 2D end-effector state
Vision encoder: Crop to 84x84, encode with DiffusionRgbEncoder (ResNet18 + spatial softmax)
Action prediction: Predict a chunk of actions and execute a subset per step

The policies differ in how they represent and predict actions:

DiTFlow denoises continuous action vectors through 100 Euler ODE steps, conditioned on visual features via AdaLN modulation (horizon=16, executes 8)
ARBeT discretizes (x,y) trajectories into directional tokens (Bresenham decomposition → Freeman chain codes → BPE compression) and predicts them autoregressively (horizon=32, executes 2 action steps / 8 token steps)

Key Dependencies

LeRobot — training framework, dataset loading, environment wrappers
gym-pusht — Push-T simulation
tokink — BPE-compressed chain-code tokenization for digital ink
rerun-sdk — visualization and debugging

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
.vscode		.vscode
outputs		outputs
scripts		scripts
src		src
tests		tests
vendor/lerobot-policy-ditflow		vendor/lerobot-policy-ditflow
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robot Action Representation

Policies

Setup

Usage

Training

Testing & Linting

Project Structure

Architecture

Key Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Robot Action Representation

Policies

Setup

Usage

Training

Testing & Linting

Project Structure

Architecture

Key Dependencies

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages