Robbyant · Lau-JW · Apr 10, 2026 · Apr 10, 2026 · Apr 10, 2026 · Apr 10, 2026
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,12 @@
+prepared_arms/
+arms/
+
+# local outputs / artifacts
+0_put_both_the_alphabet_soup_and_the_tomato_sauce_in_the_basket/
+outputs/
+
+# python cache
+__pycache__/
+**/__pycache__/
+*.pyc
+
diff --git a/Cursor b/Cursor
@@ -0,0 +1,2 @@
+Placeholder file to satisfy tooling argument parsing.
+
diff --git a/README.md b/README.md
@@ -66,6 +66,170 @@ pip install websockets einops diffusers==0.36.0 transformers==4.55.2 accelerate
 pip install flash-attn --no-build-isolation
 ```
 
+---
+
+## 我们的改动（arms + ROCm）
+
+本仓库在上游基础上，增加了适配 **AMD ROCm（MI300X）** 与自定义双臂单相机数据集 **`arms/`** 的训练/数据准备流程，并把远程 LIBERO client 的落盘能力补齐（mp4/png/npz/action/joint）。
+
+### 快速开始（MI300X / ROCm 7.2）
+
+#### 1) 安装依赖
+
+```bash
+python3 -m venv ~/venvs/lingbot-va
+source ~/venvs/lingbot-va/bin/activate
+python -m pip install -U pip
+
+# 大 wheel 建议关缓存，避免 pip 报 Memoryview is too large
+PIP_NO_CACHE_DIR=1 pip install -r requirements.txt
+```
+
+#### 2) 准备 arms 数据
+
+```bash
+python scripts/prepare_arms_dataset.py --arms-root ./arms --split train --out ./prepared_arms
+```
+
+#### 3) 下载 checkpoint（lingbot-va-base）
+
+```bash
+pip install -U huggingface_hub
+hf download --repo-type model robbyant/lingbot-va-base --local-dir /root/checkpoints/lingbot-va-base
+```
+
+#### 4) 提取 VAE latents
+
+```bash
+python scripts/extract_arms_latents.py \
+  --dataset-root ./prepared_arms \
+  --ckpt-dir /root/checkpoints/lingbot-va-base \
+  --device cuda \
+  --dtype bfloat16 \
+  --height 256 --width 256
+```
+
+#### 5) 单卡微调训练（post-training）
+
+```bash
+export TORCHDYNAMO_DISABLE=1   # 更稳（先跑通）
+python -m wan_va.train --config-name arms_train --save-root ./train_out_arms
+```
+
+建议用 `tmux` 运行，防止断网中断：
+
+```bash
+tmux new -s arms_train
+# 运行训练命令后，Ctrl+b 再按 d 退出但继续跑
+```
+
+### WandB（可选）
+
+`arms_train` 默认启用 wandb；若缺少 `WANDB_*` 会自动降级关闭。
+要开启请设置：
+
+```bash
+export WANDB_BASE_URL="https://api.wandb.ai"
+export WANDB_API_KEY="..."
+export WANDB_TEAM_NAME="..."
+export WANDB_PROJECT="va_arms"
+```
+
+### LIBERO（录制 mp4/png/npz）
+
+详细见 `ROCM_LIBERO_SETUP.md`（包含 EGL/osmesa、以及 client 输出落盘路径说明）。
+
+### LIBERO post-training（LeRobot 格式 + latents）
+
+与 `arms_train` 不同，**官方管线走 LeRobot latent 数据集**（`MultiLatentLeRobotDataset`），配置名为 **`libero_train`**（`wan_va/configs/va_libero_train_cfg.py`，相机与分辨率见 `va_libero_cfg`：`128×128`、双相机 `agentview_rgb` + `eye_in_hand_rgb`）。
+
+1. **安装**（ROCm 上 `lerobot` 建议 `--no-deps`，避免动 torch）：
+
+```bash
+pip install --no-deps -r requirements_posttrain.txt
+pip install scipy wandb
+```
+
+2. **准备数据**：目录下要有标准 LeRobot 数据集（递归能找到 `meta/info.json`），`episodes` 里带 **`action_config`** 分段；并在**每个相机 key** 下准备好与 `episodes` 对齐的 latent：
+
+`latents/chunk-XXX/<obs_cam_key>/episode_XXXXXX_start_end.pth`
+
+（字段需与 `ArmsLatentDataset`/现有提取脚本一致：`latent`、`frame_ids`、`text_emb`、`latent_num_frames`、`latent_height`、`latent_width`。本仓库目前只自带 `scripts/extract_arms_latents.py`，LIBERO 双相机 + 128 分辨率需按 `va_libero_cfg.obs_cam_keys` 与 LeRobot 视频路径**自行对齐提取**或从上游/社区找现成 LeRobot+latents。）
+
+3. **`empty_emb.pt`**：放在你指定的 `empty_emb_path`（与 `va_libero_train_cfg` 中一致），形状与 `text_emb` 相同（可用任意一条 latent 里的 `text_emb` 做 `zeros_like` 生成）。
+
+4. **改配置**：编辑 `va_libero_train_cfg.py` 里的 `dataset_path`、`empty_emb_path`、`wan22_pretrained_model_name_or_path`（与 arms 相同即可）。
+
+5. **开训**：
+
+```bash
+export TORCHDYNAMO_DISABLE=1
+python -m wan_va.train --config-name libero_train --save-root ./train_out_libero
+```
+
+---
+
+## 今日问题与修复对照表（按实际发生顺序）
+
+这一节专门记录你今天在服务器上跑流程时遇到的报错、当时哪里跑错、以及我们最终改了哪些代码把它跑通。
+
+### A. LIBERO / 录制落盘相关
+
+- **`ModuleNotFoundError: No module named 'wan_va'`（跑 client）**
+  - **触发方式**：用 `python evaluation/libero/client.py` 直接跑文件，Python 没把仓库根目录当包路径。
+  - **正确方式**：在仓库根目录用 `python -m evaluation.libero.client ...` 或临时 `PYTHONPATH=.`。
+
+- **`pip install libero` 报 “inconsistent version”**
+  - **原因**：PyPI 的同名包元数据不一致（文件名 0.1.1 / metadata 0.1.0）。
+  - **正确安装**：装 LIBERO 官方仓库源码（`pip install -e ~/LIBERO`），见 `ROCM_LIBERO_SETUP.md` 第 3 节。
+
+- **`AttributeError: 'NoneType' object has no attribute 'eglQueryString'`（robosuite/mujoco）**
+  - **原因**：无头渲染 EGL 没配置好（系统 EGL/Mesa 依赖或环境变量缺失）。
+  - **修复**：安装 EGL/Mesa 依赖 + `PyOpenGL-accelerate`，并在启动前 `export MUJOCO_GL=egl` / `export PYOPENGL_PLATFORM=egl`。
+  - **兜底**：EGL 真不可用时走 `--mujoco-gl osmesa`（更慢但能跑）。
+
+- **“远端写不出 mp4 / ffmpeg backend”**
+  - **修复**：`pip install "imageio[ffmpeg]"`；并保留 PNG 关键帧，可在本地 `ffmpeg` 合成 mp4。
+
+- **“我想保存 mp4/png/npz（含 action + joint）”**
+  - **我们改的代码**：`evaluation/libero/client.py`
+    - **新增**：关键帧 PNG、轨迹 `.npz`（actions + joint/EEF/gripper + policy_chunks）、视频 mp4（失败回退 gif）。
+  - **落盘位置**：`--out-dir` 指定目录下（默认 `outputs/libero/...`），详见 `ROCM_LIBERO_SETUP.md` 的 7.3。
+
+### B. arms 数据集（双臂单相机）训练相关
+
+- **为什么要先提 latents 再训练？**
+  - **原因**：训练输入不是原始 RGB，而是 Wan2.2 VAE 编码后的 latent（省显存/加速/与预训练对齐）。
+  - **对应脚本**：`scripts/extract_arms_latents.py`
+
+- **`ModuleNotFoundError: No module named 'wan_va'`（提 latents 脚本）**
+  - **原因**：脚本直接运行时 import 路径不包含 repo root。
+  - **修复**：在脚本中加入 `sys.path.insert(0, repo_root)`（已在 `scripts/extract_arms_latents.py` 里做）。
+
+- **VAE temporal shape 报错（例如 conv3d kernel > input / T 不匹配）**
+  - **原因**：Wan VAE 的时间维对齐很敏感，chunk/步长/偶数长度都会影响。
+  - **修复策略（已实现）**：优先走非 streaming `vae.encode(x)`；失败时自动重试（`::2` 下采样、裁掉/补齐一帧等）。
+
+- **`ValueError: environment variable MASTER_ADDR expected, but not set`（单卡训练）**
+  - **原因**：训练入口无条件 init distributed。
+  - **修复**：`wan_va/distributed/util.py`：`world_size<=1` 时跳过 `dist.init_process_group`。
+
+- **`KeyError` / wandb 环境变量缺失导致启动失败**
+  - **修复**：`wan_va/train.py`：检测缺少 `WANDB_*` 时自动关闭 wandb（即使 config 里 True）。
+
+- **`FileNotFoundError: ./prepared_arms/empty_emb.pt`**
+  - **修复**：`wan_va/dataset/arms_latent_dataset.py`：自动从已有 latent 文件推断 `text_emb` 形状并生成 `empty_emb.pt`。
+
+- **`Cannot re-initialize CUDA in forked subprocess`（DataLoader worker）**
+  - **原因**：latent `.pth` 里可能存了 CUDA tensor 或加载时映射到 CUDA，worker fork 后触发 CUDA re-init。
+  - **修复**：
+    - `scripts/extract_arms_latents.py`：保存前 `.cpu()`；
+    - `wan_va/dataset/arms_latent_dataset.py`：`torch.load(..., map_location="cpu")`；
+    - `wan_va/configs/va_arms_train_cfg.py`：默认 `load_worker=0`。
+
+- **`flex_attention` 的 `block_mask` 长度不匹配**
+  - **修复**：`wan_va/modules/model.py`：对 mask 调用 `_adjust(q_len, kv_len)` 做裁剪对齐。
+
 
 ## ⚠️ Important: `attn_mode` Configuration
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		Placeholder file to satisfy tooling argument parsing.