Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers

🚀 Project Website: https://ian-chuang.github.io/gaze-av-aloha/

This repository contains the official code for our paper: "Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers"

We propose a human-inspired foveated vision framework for robot learning that combines human gaze, foveated ViTs, and robotic control to enable policies that are both efficient and robust. Our approach reduces ViT computation by 94%, accelerating training by 7× and inference by 3×.

We collect bimanual robot demonstrations with synchronized human eye-tracking using the AV-ALOHA simulation platform. This repository provides code and instructions for installation, dataset preparation, model training, policy evaluation, and data collection.

@misc{chuang2025lookfocusactefficient,
      title={Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers}, 
      author={Ian Chuang and Andrew Lee and Dechen Gao and Jinyu Zou and Iman Soltani},
      year={2025},
      eprint={2507.15833},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2507.15833}, 
}

AV ALOHA Simulation Datasets

Below lists all available AV-ALOHA simulation datasets with human eye-tracking annotations. Each dataset includes over 100 episodes and a link for interactive visualization.

Dataset	Eye Data	Episodes	Visualization
AV ALOHA Sim Peg Insertion	✅	100	View
AV ALOHA Sim Cube Transfer	✅	200	View
AV ALOHA Sim Thread Needle	✅	200	View
AV ALOHA Sim Pour Test Tube	✅	100	View
AV ALOHA Sim Hook Package	✅	100	View
AV ALOHA Sim Slot Insertion	✅	100	View

Installation

Follow the steps below to set up the environment and install all necessary dependencies.

# Clone the repository and initialize submodules
git clone https://github.com/ian-chuang/gaze-av-aloha.git
cd gaze-av-aloha
git submodule init
git submodule update

# Create and activate a new Conda environment
conda create -n gaze python=3.10
conda activate gaze

# Install LeRobot
pip install git+https://github.com/huggingface/lerobot.git@483be9aac217c2d8ef16982490f22b2ad091ab46

# Install FFmpeg for video logging
conda install ffmpeg=7.1.1 -c conda-forge

# Install AV-ALOHA packages
pip install -e ./gym_av_aloha
pip install -e ./gaze_av_aloha

Authentication

Make sure you're logged in to both Weights & Biases and Hugging Face:

wandb login
huggingface-cli login

Download and Preprocess Dataset

We use the LeRobot dataset format for ease of sharing and visualization via Hugging Face. However, LeRobot's dataloader can be slow, so we convert each dataset into a custom AVAlohaDataset format based on Zarr for faster access during training.

Available Dataset Repository IDs

iantc104/av_aloha_sim_cube_transfer
iantc104/av_aloha_sim_peg_insertion
iantc104/av_aloha_sim_slot_insertion
iantc104/av_aloha_sim_hook_package
iantc104/av_aloha_sim_pour_test_tube
iantc104/av_aloha_sim_thread_needle

Conversion Instructions

To convert a dataset to Zarr format, run the following command from the project root:

python gym_av_aloha/scripts/convert_lerobot_to_avaloha.py --repo_id <dataset_repo_id>

For example:

python gym_av_aloha/scripts/convert_lerobot_to_avaloha.py --repo_id iantc104/av_aloha_sim_thread_needle

Converted datasets will be saved under:

gym_av_aloha/outputs/

AV ALOHA Benchmark

Train and evaluate policies using train.py.

Exact commands used in our simulation experiments are provided in experiments.txt. Pretrained weights are available and can be loaded via Hydra configuration.

Pretrained ViT Weights (MAE Pretrained)

Pretrained Gaze Models (Task‑Specific)

Available Task Configs

av_aloha_sim_cube_transfer
av_aloha_sim_peg_insertion
av_aloha_sim_slot_insertion
av_aloha_sim_hook_package
av_aloha_sim_pour_test_tube
av_aloha_sim_thread_needle

Train & Evaluate Policies

Fov-Act (end-to-end gaze as action):

python gaze_av_aloha/scripts/train.py \
  policy=foveated_vit_policy \
  task=<task e.g. av_aloha_sim_thread_needle> \
  policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_foveated_vit \
  policy.optimizer_lr_backbone=1e-5 \
  wandb.enable=true \
  wandb.project=<project name> \
  wandb.entity=<your wandb entity> \
  wandb.job_name=fov-act \
  device=cuda

Fov-UNet (two-stage with pretrained gaze model):

python gaze_av_aloha/scripts/train.py \
  policy=foveated_vit_policy \
  task=<task e.g. av_aloha_sim_thread_needle> \
  policy.use_gaze_as_action=false \
  policy.gaze_model_repo_id=<gaze model e.g. iantc104/gaze_model_av_aloha_sim_thread_needle> \
  policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_foveated_vit \
  policy.optimizer_lr_backbone=1e-5 \
  wandb.enable=true \
  wandb.project=<project name> \
  wandb.entity=<your wandb entity> \
  wandb.job_name=fov-unet \
  device=cuda

Fine (full-res ViT baseline):

python gaze_av_aloha/scripts/train.py \
  policy=vit_policy \
  task=<task e.g. av_aloha_sim_thread_needle> \
  policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_vit \
  policy.optimizer_lr_backbone=1e-5 \
  wandb.enable=true \
  wandb.project=<project name> \
  wandb.entity=<your wandb entity> \
  wandb.job_name=fine \
  device=cuda

Coarse (low-res ViT baseline):

python gaze_av_aloha/scripts/train.py \
  policy=low_res_vit_policy \
  task=<task e.g. av_aloha_sim_thread_needle> \
  policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_low_res_vit \
  policy.optimizer_lr_backbone=1e-5 \
  wandb.enable=true \
  wandb.project=<project name> \
  wandb.entity=<your wandb entity> \
  wandb.job_name=coarse \
  device=cuda

Additional Resources

AV-ALOHA Simulation Data Collection

To collect simulation data using AV-ALOHA:

Install the AV-ALOHA Unity App on your Meta Quest Pro headset: 👉 AV-ALOHA Unity App
Follow the instructions in gym_av_aloha/README.md for detailed steps on data collection.

MAE Pretraining

We provide MAE-pretrained Vision Transformers used for:

Foveated
Fine (Full-Res)
Coarse (Low-Res)

Details and training scripts are located in: 📄 pretrain/README.md

Gaze Model Training

Train a simple UNet-based model for gaze prediction using the script:

python gaze_av_aloha/scripts/train_gaze_model.py --task <task_name>

Supported task names:

thread_needle
pour_test_tube
hook_package
slot_insertion
cube_transfer
peg_insertion

The resulting models will be pushed to Hugging Face under the appropriate task-specific repo.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
gaze_av_aloha		gaze_av_aloha
gym_av_aloha @ e13410c		gym_av_aloha @ e13410c
media		media
pretrain		pretrain
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
experiments.txt		experiments.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers

AV ALOHA Simulation Datasets

Installation

Authentication

Download and Preprocess Dataset

Available Dataset Repository IDs

Conversion Instructions

AV ALOHA Benchmark

Pretrained ViT Weights (MAE Pretrained)

Pretrained Gaze Models (Task‑Specific)

Available Task Configs

Train & Evaluate Policies

Additional Resources

AV-ALOHA Simulation Data Collection

MAE Pretraining

Gaze Model Training

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers

AV ALOHA Simulation Datasets

Installation

Authentication

Download and Preprocess Dataset

Available Dataset Repository IDs

Conversion Instructions

AV ALOHA Benchmark

Pretrained ViT Weights (MAE Pretrained)

Pretrained Gaze Models (Task‑Specific)

Available Task Configs

Train & Evaluate Policies

Additional Resources

AV-ALOHA Simulation Data Collection

MAE Pretraining

Gaze Model Training

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages