Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
c1cc33d
Add_arms_dataset_training_pipeline
Lau-JW Apr 10, 2026
6855c47
Arms_pipeline_code
Lau-JW Apr 10, 2026
aedaff5
ROCm_requirements_update
Lau-JW Apr 10, 2026
1bc1d22
Split_posttrain_requirements
Lau-JW Apr 10, 2026
432d9c6
Fix_posttrain_install_notes
Lau-JW Apr 10, 2026
f869ffe
Add_arms_latent_extractor
Lau-JW Apr 10, 2026
8b5b201
Fix_extract_script_import_path
Lau-JW Apr 10, 2026
8868c52
Pin_huggingface_hub_lt1
Lau-JW Apr 10, 2026
9ca0f67
Fix_vae_streaming_chunk_padding
Lau-JW Apr 10, 2026
3796117
Fix_first_chunk_temporal_padding
Lau-JW Apr 10, 2026
aba79ed
Use_non_streaming_vae_encode_default
Lau-JW Apr 10, 2026
731d126
Fix_non_streaming_encoder_feat_cache
Lau-JW Apr 10, 2026
5f62d3a
Encode_full_clip_in_one_call
Lau-JW Apr 10, 2026
d004820
Add_temporal_downsample_fallback_for_vae
Lau-JW Apr 10, 2026
5dfab39
Fix_extractor_indentation
Lau-JW Apr 10, 2026
f3e86b4
Robustify_vae_temporal_retry
Lau-JW Apr 10, 2026
e6dc5ca
Try_vae_encode_api_first
Lau-JW Apr 10, 2026
77262fa
Avoid_lerobot_import_for_arms_train
Lau-JW Apr 10, 2026
aefb7a9
Allow_single_process_training_without_dist_env
Lau-JW Apr 10, 2026
770a22a
Disable_wandb_by_default_for_arms
Lau-JW Apr 10, 2026
e6a2a89
Set_arms_pretrained_checkpoint_path
Lau-JW Apr 10, 2026
2f35f24
Auto_create_empty_emb_from_latents
Lau-JW Apr 10, 2026
7f021cd
Add_snr_shift_defaults_for_arms
Lau-JW Apr 10, 2026
1c34b2c
Fix_cpu_latents_and_disable_workers
Lau-JW Apr 10, 2026
94a21b5
Guard_fsdp_gradient_sync_call
Lau-JW Apr 10, 2026
cfb3db1
Adjust_flex_attention_block_mask_to_lengths
Lau-JW Apr 10, 2026
2fb955e
Fix_action_loss_weight_shape_for_arms
Lau-JW Apr 10, 2026
d0cbc68
Enable_wandb_by_default_for_arms
Lau-JW Apr 10, 2026
5d453b6
Docs_sync_ROCm_arms_workflow
Lau-JW Apr 10, 2026
80495eb
Docs_add_issue_fix_changelog
Lau-JW Apr 10, 2026
9a53b9b
Fix_arms_OOM_by_cropping_latent_frames
Lau-JW Apr 10, 2026
6dd535b
Add_train_resume_flag_and_state
Lau-JW Apr 10, 2026
e3edb60
Fix_wandb_loss_charts_keys
Lau-JW Apr 10, 2026
0c98932
Fix_tqdm_step_counter_match_optimizer_step
Lau-JW Apr 11, 2026
7c8e2a9
Register_libero_train_and_doc_LeRobot_flow
Lau-JW Apr 11, 2026
25255ca
Update ROCM_LIBERO_SETUP.md
Lau-JW Apr 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
prepared_arms/
arms/

# local outputs / artifacts
0_put_both_the_alphabet_soup_and_the_tomato_sauce_in_the_basket/
outputs/

# python cache
__pycache__/
**/__pycache__/
*.pyc

2 changes: 2 additions & 0 deletions Cursor
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Placeholder file to satisfy tooling argument parsing.

164 changes: 164 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,170 @@ pip install websockets einops diffusers==0.36.0 transformers==4.55.2 accelerate
pip install flash-attn --no-build-isolation
```

---

## 我们的改动(arms + ROCm)

本仓库在上游基础上,增加了适配 **AMD ROCm(MI300X)** 与自定义双臂单相机数据集 **`arms/`** 的训练/数据准备流程,并把远程 LIBERO client 的落盘能力补齐(mp4/png/npz/action/joint)。

### 快速开始(MI300X / ROCm 7.2)

#### 1) 安装依赖

```bash
python3 -m venv ~/venvs/lingbot-va
source ~/venvs/lingbot-va/bin/activate
python -m pip install -U pip

# 大 wheel 建议关缓存,避免 pip 报 Memoryview is too large
PIP_NO_CACHE_DIR=1 pip install -r requirements.txt
```

#### 2) 准备 arms 数据

```bash
python scripts/prepare_arms_dataset.py --arms-root ./arms --split train --out ./prepared_arms
```

#### 3) 下载 checkpoint(lingbot-va-base)

```bash
pip install -U huggingface_hub
hf download --repo-type model robbyant/lingbot-va-base --local-dir /root/checkpoints/lingbot-va-base
```

#### 4) 提取 VAE latents

```bash
python scripts/extract_arms_latents.py \
--dataset-root ./prepared_arms \
--ckpt-dir /root/checkpoints/lingbot-va-base \
--device cuda \
--dtype bfloat16 \
--height 256 --width 256
```

#### 5) 单卡微调训练(post-training)

```bash
export TORCHDYNAMO_DISABLE=1 # 更稳(先跑通)
python -m wan_va.train --config-name arms_train --save-root ./train_out_arms
```

建议用 `tmux` 运行,防止断网中断:

```bash
tmux new -s arms_train
# 运行训练命令后,Ctrl+b 再按 d 退出但继续跑
```

### WandB(可选)

`arms_train` 默认启用 wandb;若缺少 `WANDB_*` 会自动降级关闭。
要开启请设置:

```bash
export WANDB_BASE_URL="https://api.wandb.ai"
export WANDB_API_KEY="..."
export WANDB_TEAM_NAME="..."
export WANDB_PROJECT="va_arms"
```

### LIBERO(录制 mp4/png/npz)

详细见 `ROCM_LIBERO_SETUP.md`(包含 EGL/osmesa、以及 client 输出落盘路径说明)。

### LIBERO post-training(LeRobot 格式 + latents)

与 `arms_train` 不同,**官方管线走 LeRobot latent 数据集**(`MultiLatentLeRobotDataset`),配置名为 **`libero_train`**(`wan_va/configs/va_libero_train_cfg.py`,相机与分辨率见 `va_libero_cfg`:`128×128`、双相机 `agentview_rgb` + `eye_in_hand_rgb`)。

1. **安装**(ROCm 上 `lerobot` 建议 `--no-deps`,避免动 torch):

```bash
pip install --no-deps -r requirements_posttrain.txt
pip install scipy wandb
```

2. **准备数据**:目录下要有标准 LeRobot 数据集(递归能找到 `meta/info.json`),`episodes` 里带 **`action_config`** 分段;并在**每个相机 key** 下准备好与 `episodes` 对齐的 latent:

`latents/chunk-XXX/<obs_cam_key>/episode_XXXXXX_start_end.pth`

(字段需与 `ArmsLatentDataset`/现有提取脚本一致:`latent`、`frame_ids`、`text_emb`、`latent_num_frames`、`latent_height`、`latent_width`。本仓库目前只自带 `scripts/extract_arms_latents.py`,LIBERO 双相机 + 128 分辨率需按 `va_libero_cfg.obs_cam_keys` 与 LeRobot 视频路径**自行对齐提取**或从上游/社区找现成 LeRobot+latents。)

3. **`empty_emb.pt`**:放在你指定的 `empty_emb_path`(与 `va_libero_train_cfg` 中一致),形状与 `text_emb` 相同(可用任意一条 latent 里的 `text_emb` 做 `zeros_like` 生成)。

4. **改配置**:编辑 `va_libero_train_cfg.py` 里的 `dataset_path`、`empty_emb_path`、`wan22_pretrained_model_name_or_path`(与 arms 相同即可)。

5. **开训**:

```bash
export TORCHDYNAMO_DISABLE=1
python -m wan_va.train --config-name libero_train --save-root ./train_out_libero
```

---

## 今日问题与修复对照表(按实际发生顺序)

这一节专门记录你今天在服务器上跑流程时遇到的报错、当时哪里跑错、以及我们最终改了哪些代码把它跑通。

### A. LIBERO / 录制落盘相关

- **`ModuleNotFoundError: No module named 'wan_va'`(跑 client)**
- **触发方式**:用 `python evaluation/libero/client.py` 直接跑文件,Python 没把仓库根目录当包路径。
- **正确方式**:在仓库根目录用 `python -m evaluation.libero.client ...` 或临时 `PYTHONPATH=.`。

- **`pip install libero` 报 “inconsistent version”**
- **原因**:PyPI 的同名包元数据不一致(文件名 0.1.1 / metadata 0.1.0)。
- **正确安装**:装 LIBERO 官方仓库源码(`pip install -e ~/LIBERO`),见 `ROCM_LIBERO_SETUP.md` 第 3 节。

- **`AttributeError: 'NoneType' object has no attribute 'eglQueryString'`(robosuite/mujoco)**
- **原因**:无头渲染 EGL 没配置好(系统 EGL/Mesa 依赖或环境变量缺失)。
- **修复**:安装 EGL/Mesa 依赖 + `PyOpenGL-accelerate`,并在启动前 `export MUJOCO_GL=egl` / `export PYOPENGL_PLATFORM=egl`。
- **兜底**:EGL 真不可用时走 `--mujoco-gl osmesa`(更慢但能跑)。

- **“远端写不出 mp4 / ffmpeg backend”**
- **修复**:`pip install "imageio[ffmpeg]"`;并保留 PNG 关键帧,可在本地 `ffmpeg` 合成 mp4。

- **“我想保存 mp4/png/npz(含 action + joint)”**
- **我们改的代码**:`evaluation/libero/client.py`
- **新增**:关键帧 PNG、轨迹 `.npz`(actions + joint/EEF/gripper + policy_chunks)、视频 mp4(失败回退 gif)。
- **落盘位置**:`--out-dir` 指定目录下(默认 `outputs/libero/...`),详见 `ROCM_LIBERO_SETUP.md` 的 7.3。

### B. arms 数据集(双臂单相机)训练相关

- **为什么要先提 latents 再训练?**
- **原因**:训练输入不是原始 RGB,而是 Wan2.2 VAE 编码后的 latent(省显存/加速/与预训练对齐)。
- **对应脚本**:`scripts/extract_arms_latents.py`

- **`ModuleNotFoundError: No module named 'wan_va'`(提 latents 脚本)**
- **原因**:脚本直接运行时 import 路径不包含 repo root。
- **修复**:在脚本中加入 `sys.path.insert(0, repo_root)`(已在 `scripts/extract_arms_latents.py` 里做)。

- **VAE temporal shape 报错(例如 conv3d kernel > input / T 不匹配)**
- **原因**:Wan VAE 的时间维对齐很敏感,chunk/步长/偶数长度都会影响。
- **修复策略(已实现)**:优先走非 streaming `vae.encode(x)`;失败时自动重试(`::2` 下采样、裁掉/补齐一帧等)。

- **`ValueError: environment variable MASTER_ADDR expected, but not set`(单卡训练)**
- **原因**:训练入口无条件 init distributed。
- **修复**:`wan_va/distributed/util.py`:`world_size<=1` 时跳过 `dist.init_process_group`。

- **`KeyError` / wandb 环境变量缺失导致启动失败**
- **修复**:`wan_va/train.py`:检测缺少 `WANDB_*` 时自动关闭 wandb(即使 config 里 True)。

- **`FileNotFoundError: ./prepared_arms/empty_emb.pt`**
- **修复**:`wan_va/dataset/arms_latent_dataset.py`:自动从已有 latent 文件推断 `text_emb` 形状并生成 `empty_emb.pt`。

- **`Cannot re-initialize CUDA in forked subprocess`(DataLoader worker)**
- **原因**:latent `.pth` 里可能存了 CUDA tensor 或加载时映射到 CUDA,worker fork 后触发 CUDA re-init。
- **修复**:
- `scripts/extract_arms_latents.py`:保存前 `.cpu()`;
- `wan_va/dataset/arms_latent_dataset.py`:`torch.load(..., map_location="cpu")`;
- `wan_va/configs/va_arms_train_cfg.py`:默认 `load_worker=0`。

- **`flex_attention` 的 `block_mask` 长度不匹配**
- **修复**:`wan_va/modules/model.py`:对 mask 调用 `_adjust(q_len, kv_len)` 做裁剪对齐。


## ⚠️ Important: `attn_mode` Configuration

Expand Down
Loading