Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers
🚀 Project Website: https://ian-chuang.github.io/gaze-av-aloha/
This repository contains the official code for our paper: "Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers"
We propose a human-inspired foveated vision framework for robot learning that combines human gaze, foveated ViTs, and robotic control to enable policies that are both efficient and robust. Our approach reduces ViT computation by 94%, accelerating training by 7× and inference by 3×.
We collect bimanual robot demonstrations with synchronized human eye-tracking using the AV-ALOHA simulation platform. This repository provides code and instructions for installation, dataset preparation, model training, policy evaluation, and data collection.
@misc{chuang2025lookfocusactefficient,
title={Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers},
author={Ian Chuang and Andrew Lee and Dechen Gao and Jinyu Zou and Iman Soltani},
year={2025},
eprint={2507.15833},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2507.15833},
}Below lists all available AV-ALOHA simulation datasets with human eye-tracking annotations. Each dataset includes over 100 episodes and a link for interactive visualization.
| Dataset | Eye Data | Episodes | Visualization |
|---|---|---|---|
| AV ALOHA Sim Peg Insertion | ✅ | 100 | View |
| AV ALOHA Sim Cube Transfer | ✅ | 200 | View |
| AV ALOHA Sim Thread Needle | ✅ | 200 | View |
| AV ALOHA Sim Pour Test Tube | ✅ | 100 | View |
| AV ALOHA Sim Hook Package | ✅ | 100 | View |
| AV ALOHA Sim Slot Insertion | ✅ | 100 | View |
Follow the steps below to set up the environment and install all necessary dependencies.
# Clone the repository and initialize submodules
git clone https://github.com/ian-chuang/gaze-av-aloha.git
cd gaze-av-aloha
git submodule init
git submodule update
# Create and activate a new Conda environment
conda create -n gaze python=3.10
conda activate gaze
# Install LeRobot
pip install git+https://github.com/huggingface/lerobot.git@483be9aac217c2d8ef16982490f22b2ad091ab46
# Install FFmpeg for video logging
conda install ffmpeg=7.1.1 -c conda-forge
# Install AV-ALOHA packages
pip install -e ./gym_av_aloha
pip install -e ./gaze_av_alohaMake sure you're logged in to both Weights & Biases and Hugging Face:
wandb login
huggingface-cli loginWe use the LeRobot dataset format for ease of sharing and visualization via Hugging Face.
However, LeRobot's dataloader can be slow, so we convert each dataset into a custom AVAlohaDataset format based on Zarr for faster access during training.
iantc104/av_aloha_sim_cube_transferiantc104/av_aloha_sim_peg_insertioniantc104/av_aloha_sim_slot_insertioniantc104/av_aloha_sim_hook_packageiantc104/av_aloha_sim_pour_test_tubeiantc104/av_aloha_sim_thread_needle
To convert a dataset to Zarr format, run the following command from the project root:
python gym_av_aloha/scripts/convert_lerobot_to_avaloha.py --repo_id <dataset_repo_id>For example:
python gym_av_aloha/scripts/convert_lerobot_to_avaloha.py --repo_id iantc104/av_aloha_sim_thread_needleConverted datasets will be saved under:
gym_av_aloha/outputs/
Train and evaluate policies using train.py.
Exact commands used in our simulation experiments are provided in experiments.txt.
Pretrained weights are available and can be loaded via Hydra configuration.
iantc104/gaze_model_av_aloha_sim_cube_transferiantc104/gaze_model_av_aloha_sim_peg_insertioniantc104/gaze_model_av_aloha_sim_slot_insertioniantc104/gaze_model_av_aloha_sim_hook_packageiantc104/gaze_model_av_aloha_sim_pour_test_tubeiantc104/gaze_model_av_aloha_sim_thread_needle
av_aloha_sim_cube_transferav_aloha_sim_peg_insertionav_aloha_sim_slot_insertionav_aloha_sim_hook_packageav_aloha_sim_pour_test_tubeav_aloha_sim_thread_needle
Fov-Act (end-to-end gaze as action):
python gaze_av_aloha/scripts/train.py \
policy=foveated_vit_policy \
task=<task e.g. av_aloha_sim_thread_needle> \
policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_foveated_vit \
policy.optimizer_lr_backbone=1e-5 \
wandb.enable=true \
wandb.project=<project name> \
wandb.entity=<your wandb entity> \
wandb.job_name=fov-act \
device=cudaFov-UNet (two-stage with pretrained gaze model):
python gaze_av_aloha/scripts/train.py \
policy=foveated_vit_policy \
task=<task e.g. av_aloha_sim_thread_needle> \
policy.use_gaze_as_action=false \
policy.gaze_model_repo_id=<gaze model e.g. iantc104/gaze_model_av_aloha_sim_thread_needle> \
policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_foveated_vit \
policy.optimizer_lr_backbone=1e-5 \
wandb.enable=true \
wandb.project=<project name> \
wandb.entity=<your wandb entity> \
wandb.job_name=fov-unet \
device=cudaFine (full-res ViT baseline):
python gaze_av_aloha/scripts/train.py \
policy=vit_policy \
task=<task e.g. av_aloha_sim_thread_needle> \
policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_vit \
policy.optimizer_lr_backbone=1e-5 \
wandb.enable=true \
wandb.project=<project name> \
wandb.entity=<your wandb entity> \
wandb.job_name=fine \
device=cudaCoarse (low-res ViT baseline):
python gaze_av_aloha/scripts/train.py \
policy=low_res_vit_policy \
task=<task e.g. av_aloha_sim_thread_needle> \
policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_low_res_vit \
policy.optimizer_lr_backbone=1e-5 \
wandb.enable=true \
wandb.project=<project name> \
wandb.entity=<your wandb entity> \
wandb.job_name=coarse \
device=cudaTo collect simulation data using AV-ALOHA:
- Install the AV-ALOHA Unity App on your Meta Quest Pro headset: 👉 AV-ALOHA Unity App
- Follow the instructions in
gym_av_aloha/README.mdfor detailed steps on data collection.
We provide MAE-pretrained Vision Transformers used for:
- Foveated
- Fine (Full-Res)
- Coarse (Low-Res)
Details and training scripts are located in:
📄 pretrain/README.md
Train a simple UNet-based model for gaze prediction using the script:
python gaze_av_aloha/scripts/train_gaze_model.py --task <task_name>Supported task names:
thread_needlepour_test_tubehook_packageslot_insertioncube_transferpeg_insertion
The resulting models will be pushed to Hugging Face under the appropriate task-specific repo.
