unilabsim · TATP-233 · Jun 29, 2026 · Jun 29, 2026 · Jun 29, 2026 · Jun 29, 2026
@@ -1,8 +1,10 @@
 # Multi-GPU
 
-The currently validated multi-GPU training path is SAC in replay-buffer mode.
-Use the unified CLI as usual, and enable multiple GPUs with the shared
-off-policy field `training.num_gpus`.
+The currently validated multi-GPU training paths are SAC/FastSAC and FlashSAC
+in replay-buffer mode. Use the unified CLI as usual, and enable multiple GPUs
+with the shared off-policy field `training.num_gpus`. The multi-GPU runner is a
+generic off-policy orchestration layer, but each learner must explicitly opt
+into the distributed learner contract.
 
 The multi-GPU runner keeps algorithm code separate from IPC: a collector fills
 the CPU replay buffer on the host, the runner packs batches for each learner
@@ -27,7 +29,8 @@ this avoids extending communication to AdamW momentum state.
 When `algo.obs_normalization=true`, each learner rank updates its observation
 normalizer from cross-rank global batch moments; rank 0 publishes the matching
 mean/std to the CPU collector at the same synchronization point as actor
-weights.
+weights. FlashSAC reward normalization keeps the replay-order update on rank 0
+and broadcasts the normalizer state to other ranks before learner updates.
 
 For strict per-update gradient averaging, set
 `training.multi_gpu_sync_mode=sync_sgd`. That mode is closer to single-GPU
@@ -53,7 +56,9 @@ single-GPU `algo.batch_size=8192` corresponds to two-GPU
 
 ## Preconditions
 
-- SAC only: `training.num_gpus > 1` rejects TD3, FlashSAC, PPO, MLX PPO, and APPO.
+- FastSAC and FlashSAC learners support this path; `training.num_gpus > 1`
+  rejects TD3, PPO, MLX PPO, APPO, and custom SAC runtimes until their learners
+  declare support.
 - CUDA is required; select physical cards with `CUDA_VISIBLE_DEVICES`.
 - SAC symmetry augmentation is not supported in multi-GPU mode. If the task
   owner enables it by default, set `algo.use_symmetry=false`.
@@ -95,6 +100,16 @@ CUDA_VISIBLE_DEVICES=0,7 uv run train --algo sac --task g1_walk_flat --sim mujoc
 
 Logs still use SAC's default directory: `logs/fast_sac/<TaskName>/`.
 
+FlashSAC uses the same knobs:
+
+```bash
+uv run train --algo flashsac --task g1_walk_flat --sim mujoco \
+  training.num_gpus=2 \
+  training.multi_gpu_sync_mode=local_sgd
+```
+
+FlashSAC logs still use `logs/flash_sac/<TaskName>/`.
+
 ## Performance Checks
 
 Multi-GPU mainly targets learner update bottlenecks. The collector is still one
@@ -116,7 +131,8 @@ compare steady-state `perf/iter_ms`, `timing/learner_train_ms`,
 
 ## Common Errors
 
-- `Only SAC supports training.num_gpus > 1`: only SAC is validated right now.
+- `<Learner> does not support training.num_gpus > 1`: that learner has not
+  declared and validated the multi-GPU contract yet.
 - `SAC multi-GPU training requires a CUDA device`: CUDA is unavailable, or
   `training.device` was set to CPU.
 - `set training.num_gpus=1 or algo.use_symmetry=false`: multi-GPU SAC does not

@@ -11,10 +11,12 @@ The off-policy runner decouples CPU simulation from GPU learning through shared
 memory: a collector subprocess fills a CPU-resident replay buffer while the
 learner trains on the GPU.
 
-SAC is also the currently validated replay-buffer multi-GPU algorithm. Enable it
-with `training.num_gpus > 1`; the host side packs and distributes batches in
-parallel, while the GPU learners default to delayed parameter averaging via
-`training.multi_gpu_sync_mode=local_sgd`. See
+The default FastSAC learner is also the currently validated replay-buffer
+multi-GPU SAC implementation. Enable it with `training.num_gpus > 1`; the host
+side packs and distributes batches in parallel, while the GPU learners default
+to delayed parameter averaging via `training.multi_gpu_sync_mode=local_sgd`.
+Custom SAC runtimes must explicitly declare the distributed learner contract
+before they can use this path. See
 {doc}`../1-training/4-multi_gpu` for the full command, strict-sync fallback, and
 constraints.
 

@@ -30,7 +30,17 @@ playback video. See {doc}`/en/1-getting_started/3-evaluation_and_playback`.
 - `algo.algo_params.actor_num_blocks=2`
 - `algo.algo_params.critic_num_blocks=2`
 
-`scripts/train_offpolicy.py` rejects `training.num_gpus > 1` for FlashSAC, so
-keep the default single-GPU path unless the implementation changes.
+FlashSAC supports the shared off-policy multi-GPU runner. Enable it with:
+
+```bash
+uv run train --algo flashsac --task g1_walk_flat --sim mujoco \
+  training.num_gpus=2 \
+  training.multi_gpu_sync_mode=local_sgd
+```
+
+Multi-GPU FlashSAC requires CUDA and synchronized collection. The learner owns
+its distributed synchronization hooks: gradients are averaged in `sync_sgd`,
+parameters and persistent normalization buffers are averaged in `local_sgd`, and
+reward normalizer state is updated on rank 0 then broadcast to the other ranks.
 
 The log root is `logs/flash_sac/<task>/`.
@@ -1,8 +1,9 @@
 # 多 GPU
 
-当前已验证的多 GPU 训练路径是 SAC 的 replay-buffer 模式。入口仍然是统一 CLI：
-`uv run train --algo sac ...`，多卡由共享 off-policy 配置字段
-`training.num_gpus` 打开。
+当前已验证的多 GPU 训练路径是 SAC/FastSAC 和 FlashSAC 的 replay-buffer 模式。入口
+仍然是统一 CLI，多卡由共享 off-policy 配置字段 `training.num_gpus` 打开。多 GPU
+runner 是通用的 off-policy 编排层，但 learner 必须通过分布式 learner contract 显式声
+明支持。
 
 多 GPU runner 保持算法与 IPC 隔离：collector 在 host 侧填充 CPU replay buffer，
 runner 根据各 learner rank 的请求打包 batch，并通过 pinned-memory pipeline 并行分
@@ -22,7 +23,8 @@ learner iteration 同步一次；增大该值可以进一步减少 4 卡、8 卡
 
 开启 `algo.obs_normalization=true` 时，每个 learner rank 使用跨 rank 聚合后的全局
 batch moments 更新 observation normalizer；rank0 在发布 actor 权重给 CPU
-collector 的同一同步点发布对应 mean/std。
+collector 的同一同步点发布对应 mean/std。FlashSAC 的 reward normalization 保持由
+rank0 按 replay 写入顺序更新，并在 learner update 前广播 normalizer 状态给其它 rank。
 
 如需严格的每次 update 梯度平均，可显式设置
 `training.multi_gpu_sync_mode=sync_sgd`。该模式更接近单卡 global batch 的同步
@@ -44,7 +46,8 @@ batch**，不是跨所有 GPU 的 global batch。`training.num_gpus=N` 时，每
 
 ## 前置条件
 
-- 只支持 SAC：`training.num_gpus > 1` 会拒绝 TD3、FlashSAC、PPO、MLX PPO 和 APPO。
+- FastSAC 与 FlashSAC learner 支持该路径；`training.num_gpus > 1` 会拒绝尚未声明该
+  能力的 TD3、PPO、MLX PPO、APPO 和 custom SAC runtime。
 - 必须使用 CUDA 设备；用 `CUDA_VISIBLE_DEVICES` 选择物理卡。
 - SAC 的对称增强当前不支持多卡；若任务 owner 默认开启，需要设置
   `algo.use_symmetry=false`。
@@ -85,6 +88,16 @@ CUDA_VISIBLE_DEVICES=0,7 uv run train --algo sac --task g1_walk_flat --sim mujoc
 
 日志仍写入 SAC 的默认目录：`logs/fast_sac/<TaskName>/`。
 
+FlashSAC 使用同一组多卡参数：
+
+```bash
+uv run train --algo flashsac --task g1_walk_flat --sim mujoco \
+  training.num_gpus=2 \
+  training.multi_gpu_sync_mode=local_sgd
+```
+
+FlashSAC 日志仍写入 `logs/flash_sac/<TaskName>/`。
+
 ## 性能检查
 
 多 GPU 主要减少 learner 更新瓶颈。collector 仍是单个 CPU 进程，所以
@@ -103,7 +116,8 @@ ring-buffer 窗口，让 CPU 随机 gather 与下一次 env step 重叠，同时
 
 ## 常见错误
 
-- `Only SAC supports training.num_gpus > 1`：当前只验证 SAC。
+- `<Learner> does not support training.num_gpus > 1`：该 learner 尚未声明并验证多
+  GPU contract。
 - `SAC multi-GPU training requires a CUDA device`：没有可用 CUDA，或
   `training.device` 被设成了 CPU。
 - `set training.num_gpus=1 or algo.use_symmetry=false`：多卡 SAC 暂不支持对称增

@@ -9,10 +9,11 @@ SAC 通过共享的 off-policy 入口 `scripts/train_offpolicy.py` 选择，TD3
 off-policy runner 通过 shared memory 把 CPU 仿真与 GPU 学习解耦：collector 子进程
 填充驻留在 CPU 上的 replay buffer，learner 在 GPU 上训练。
 
-SAC 也是当前已验证的 replay-buffer 多 GPU 训练算法。多卡模式通过
+默认 FastSAC learner 也是当前已验证的 replay-buffer 多 GPU SAC 实现。多卡模式通过
 `training.num_gpus > 1` 打开，host 侧并行打包并分发 batch，多张 GPU 上的 learner
-默认使用 `training.multi_gpu_sync_mode=local_sgd` 做 delayed-sync 参数平均。完整命
-令、严格同步回退和限制见 {doc}`../1-training/4-multi_gpu`。
+默认使用 `training.multi_gpu_sync_mode=local_sgd` 做 delayed-sync 参数平均。custom
+SAC runtime 必须显式声明 distributed learner contract 后才能使用这条路径。完整命令、
+严格同步回退和限制见 {doc}`../1-training/4-multi_gpu`。
 
 ## 快速开始
 

@@ -28,7 +28,16 @@ uv run train --algo flashsac --task go2_joystick_flat --sim mujoco training.no_p
 - `algo.algo_params.actor_num_blocks=2`
 - `algo.algo_params.critic_num_blocks=2`
 
-`scripts/train_offpolicy.py` 会拒绝 FlashSAC 的 `training.num_gpus > 1`，因此除非实
-现发生变化，否则请保持默认的单 GPU 路径。
+FlashSAC 支持共享的 off-policy 多 GPU runner。使用方式：
+
+```bash
+uv run train --algo flashsac --task g1_walk_flat --sim mujoco \
+  training.num_gpus=2 \
+  training.multi_gpu_sync_mode=local_sgd
+```
+
+多 GPU FlashSAC 要求 CUDA 和同步采集。learner 自己拥有分布式同步 hook：
+`sync_sgd` 下同步梯度，`local_sgd` 下平均参数和 persistent normalization buffer，
+reward normalizer 由 rank0 按 replay 写入顺序更新后广播给其它 rank。
 
 日志根目录为 `logs/flash_sac/<task>/`。
@@ -178,8 +178,6 @@ def build_runner(algo_name: str, cfg: DictConfig):
     num_gpus = int(getattr(cfg.training, "num_gpus", 1))
     multi_gpu_sync_mode = str(getattr(cfg.training, "multi_gpu_sync_mode", "local_sgd"))
     multi_gpu_sync_interval = int(getattr(cfg.training, "multi_gpu_sync_interval", 1))
-    if num_gpus > 1 and algo_name != "sac":
-        raise ValueError("Only SAC supports training.num_gpus > 1 in this validation round")
 
     sync_collection = not bool(cfg.training.no_sync_collection)
 
@@ -270,15 +268,25 @@ def build_runner(algo_name: str, cfg: DictConfig):
             "critic_obs_dim": _critic_dim,
             **_learner_extra_kwargs,
         }
-        _learner = _learner_cls(device=_device, **_learner_kwargs)
 
         if num_gpus > 1:
+            from unilab.algos.torch.offpolicy.distributed import (
+                validate_distributed_learner_capability,
+            )
             from unilab.algos.torch.offpolicy.multi_gpu_runner import MultiGPUOffPolicyRunner
 
             if not str(_device).startswith("cuda"):
-                raise ValueError("SAC multi-GPU training requires a CUDA device")
+                raise ValueError(f"{_algo_type} multi-GPU training requires a CUDA device")
             if not sync_collection:
                 raise ValueError("Multi-GPU off-policy replay requires synchronized collection")
+            validate_distributed_learner_capability(
+                algo_type=_algo_type,
+                learner_cls=_learner_cls,
+                learner_kwargs=_learner_kwargs,
+                num_gpus=num_gpus,
+                sync_mode=multi_gpu_sync_mode,
+            )
+            _learner = _learner_cls(device=_device, **_learner_kwargs)
             return MultiGPUOffPolicyRunner(
                 learner=_learner,
                 env_name=cfg.training.task_name,
@@ -312,6 +320,7 @@ def build_runner(algo_name: str, cfg: DictConfig):
                 nan_guard_cfg=_nan_guard_cfg,
             )
 
+        _learner = _learner_cls(device=_device, **_learner_kwargs)
         return DoubleBufferOffPolicyRunner(
             learner=_learner,
             env_name=cfg.training.task_name,
@@ -344,11 +353,21 @@ def build_runner(algo_name: str, cfg: DictConfig):
     if algo_name == "td3":
         from unilab.algos.torch.common.device import get_env_dims
         from unilab.algos.torch.fast_td3.learner import FastTD3Learner
+        from unilab.algos.torch.offpolicy.distributed import (
+            validate_distributed_learner_capability,
+        )
         from unilab.algos.torch.offpolicy.double_buffer_runner import (
             DoubleBufferOffPolicyRunner,
         )
         from unilab.utils.device import get_default_device
 
+        validate_distributed_learner_capability(
+            learner_cls=FastTD3Learner,
+            algo_type="td3",
+            learner_kwargs={},
+            num_gpus=num_gpus,
+            sync_mode=multi_gpu_sync_mode,
+        )
         _device = cfg.training.device or get_default_device()
         _obs_dim, _action_dim, _critic_dim = get_env_dims(
             cfg.training.task_name,

@@ -378,6 +378,10 @@ class FastSACLearner:
     - Distributional critic (C51, num_atoms=101)
     """
 
+    supports_multi_gpu = True
+    supports_multi_gpu_symmetry = False
+    supported_multi_gpu_sync_modes = frozenset({"sync_sgd", "local_sgd"})
+
     def __init__(
         self,
         obs_dim: int,

@@ -134,6 +134,10 @@ class FastTD3Learner:
     - Observation normalization
     """
 
+    supports_multi_gpu = False
+    supports_multi_gpu_symmetry = False
+    supported_multi_gpu_sync_modes: frozenset[str] = frozenset()
+
     def __init__(
         self,
         obs_dim: int,