Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 22 additions & 6 deletions docs/sphinx/source/en/2-user_guide/1-training/4-multi_gpu.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Multi-GPU

The currently validated multi-GPU training path is SAC in replay-buffer mode.
Use the unified CLI as usual, and enable multiple GPUs with the shared
off-policy field `training.num_gpus`.
The currently validated multi-GPU training paths are SAC/FastSAC and FlashSAC
in replay-buffer mode. Use the unified CLI as usual, and enable multiple GPUs
with the shared off-policy field `training.num_gpus`. The multi-GPU runner is a
generic off-policy orchestration layer, but each learner must explicitly opt
into the distributed learner contract.

The multi-GPU runner keeps algorithm code separate from IPC: a collector fills
the CPU replay buffer on the host, the runner packs batches for each learner
Expand All @@ -27,7 +29,8 @@ this avoids extending communication to AdamW momentum state.
When `algo.obs_normalization=true`, each learner rank updates its observation
normalizer from cross-rank global batch moments; rank 0 publishes the matching
mean/std to the CPU collector at the same synchronization point as actor
weights.
weights. FlashSAC reward normalization keeps the replay-order update on rank 0
and broadcasts the normalizer state to other ranks before learner updates.

For strict per-update gradient averaging, set
`training.multi_gpu_sync_mode=sync_sgd`. That mode is closer to single-GPU
Expand All @@ -53,7 +56,9 @@ single-GPU `algo.batch_size=8192` corresponds to two-GPU

## Preconditions

- SAC only: `training.num_gpus > 1` rejects TD3, FlashSAC, PPO, MLX PPO, and APPO.
- FastSAC and FlashSAC learners support this path; `training.num_gpus > 1`
rejects TD3, PPO, MLX PPO, APPO, and custom SAC runtimes until their learners
declare support.
- CUDA is required; select physical cards with `CUDA_VISIBLE_DEVICES`.
- SAC symmetry augmentation is not supported in multi-GPU mode. If the task
owner enables it by default, set `algo.use_symmetry=false`.
Expand Down Expand Up @@ -95,6 +100,16 @@ CUDA_VISIBLE_DEVICES=0,7 uv run train --algo sac --task g1_walk_flat --sim mujoc

Logs still use SAC's default directory: `logs/fast_sac/<TaskName>/`.

FlashSAC uses the same knobs:

```bash
uv run train --algo flashsac --task g1_walk_flat --sim mujoco \
training.num_gpus=2 \
training.multi_gpu_sync_mode=local_sgd
```

FlashSAC logs still use `logs/flash_sac/<TaskName>/`.

## Performance Checks

Multi-GPU mainly targets learner update bottlenecks. The collector is still one
Expand All @@ -116,7 +131,8 @@ compare steady-state `perf/iter_ms`, `timing/learner_train_ms`,

## Common Errors

- `Only SAC supports training.num_gpus > 1`: only SAC is validated right now.
- `<Learner> does not support training.num_gpus > 1`: that learner has not
declared and validated the multi-GPU contract yet.
- `SAC multi-GPU training requires a CUDA device`: CUDA is unavailable, or
`training.device` was set to CPU.
- `set training.num_gpus=1 or algo.use_symmetry=false`: multi-GPU SAC does not
Expand Down
10 changes: 6 additions & 4 deletions docs/sphinx/source/en/2-user_guide/2-algorithms/3-sac.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,12 @@ The off-policy runner decouples CPU simulation from GPU learning through shared
memory: a collector subprocess fills a CPU-resident replay buffer while the
learner trains on the GPU.

SAC is also the currently validated replay-buffer multi-GPU algorithm. Enable it
with `training.num_gpus > 1`; the host side packs and distributes batches in
parallel, while the GPU learners default to delayed parameter averaging via
`training.multi_gpu_sync_mode=local_sgd`. See
The default FastSAC learner is also the currently validated replay-buffer
multi-GPU SAC implementation. Enable it with `training.num_gpus > 1`; the host
side packs and distributes batches in parallel, while the GPU learners default
to delayed parameter averaging via `training.multi_gpu_sync_mode=local_sgd`.
Custom SAC runtimes must explicitly declare the distributed learner contract
before they can use this path. See
{doc}`../1-training/4-multi_gpu` for the full command, strict-sync fallback, and
constraints.

Expand Down
14 changes: 12 additions & 2 deletions docs/sphinx/source/en/2-user_guide/2-algorithms/5-flash_sac.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,17 @@ playback video. See {doc}`/en/1-getting_started/3-evaluation_and_playback`.
- `algo.algo_params.actor_num_blocks=2`
- `algo.algo_params.critic_num_blocks=2`

`scripts/train_offpolicy.py` rejects `training.num_gpus > 1` for FlashSAC, so
keep the default single-GPU path unless the implementation changes.
FlashSAC supports the shared off-policy multi-GPU runner. Enable it with:

```bash
uv run train --algo flashsac --task g1_walk_flat --sim mujoco \
training.num_gpus=2 \
training.multi_gpu_sync_mode=local_sgd
```

Multi-GPU FlashSAC requires CUDA and synchronized collection. The learner owns
its distributed synchronization hooks: gradients are averaged in `sync_sgd`,
parameters and persistent normalization buffers are averaged in `local_sgd`, and
reward normalizer state is updated on rank 0 then broadcast to the other ranks.

The log root is `logs/flash_sac/<task>/`.
26 changes: 20 additions & 6 deletions docs/sphinx/source/zh_CN/2-user_guide/1-training/4-multi_gpu.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# 多 GPU

当前已验证的多 GPU 训练路径是 SAC 的 replay-buffer 模式。入口仍然是统一 CLI:
`uv run train --algo sac ...`,多卡由共享 off-policy 配置字段
`training.num_gpus` 打开。
当前已验证的多 GPU 训练路径是 SAC/FastSAC 和 FlashSAC 的 replay-buffer 模式。入口
仍然是统一 CLI,多卡由共享 off-policy 配置字段 `training.num_gpus` 打开。多 GPU
runner 是通用的 off-policy 编排层,但 learner 必须通过分布式 learner contract 显式声
明支持。

多 GPU runner 保持算法与 IPC 隔离:collector 在 host 侧填充 CPU replay buffer,
runner 根据各 learner rank 的请求打包 batch,并通过 pinned-memory pipeline 并行分
Expand All @@ -22,7 +23,8 @@ learner iteration 同步一次;增大该值可以进一步减少 4 卡、8 卡

开启 `algo.obs_normalization=true` 时,每个 learner rank 使用跨 rank 聚合后的全局
batch moments 更新 observation normalizer;rank0 在发布 actor 权重给 CPU
collector 的同一同步点发布对应 mean/std。
collector 的同一同步点发布对应 mean/std。FlashSAC 的 reward normalization 保持由
rank0 按 replay 写入顺序更新,并在 learner update 前广播 normalizer 状态给其它 rank。

如需严格的每次 update 梯度平均,可显式设置
`training.multi_gpu_sync_mode=sync_sgd`。该模式更接近单卡 global batch 的同步
Expand All @@ -44,7 +46,8 @@ batch**,不是跨所有 GPU 的 global batch。`training.num_gpus=N` 时,每

## 前置条件

- 只支持 SAC:`training.num_gpus > 1` 会拒绝 TD3、FlashSAC、PPO、MLX PPO 和 APPO。
- FastSAC 与 FlashSAC learner 支持该路径;`training.num_gpus > 1` 会拒绝尚未声明该
能力的 TD3、PPO、MLX PPO、APPO 和 custom SAC runtime。
- 必须使用 CUDA 设备;用 `CUDA_VISIBLE_DEVICES` 选择物理卡。
- SAC 的对称增强当前不支持多卡;若任务 owner 默认开启,需要设置
`algo.use_symmetry=false`。
Expand Down Expand Up @@ -85,6 +88,16 @@ CUDA_VISIBLE_DEVICES=0,7 uv run train --algo sac --task g1_walk_flat --sim mujoc

日志仍写入 SAC 的默认目录:`logs/fast_sac/<TaskName>/`。

FlashSAC 使用同一组多卡参数:

```bash
uv run train --algo flashsac --task g1_walk_flat --sim mujoco \
training.num_gpus=2 \
training.multi_gpu_sync_mode=local_sgd
```

FlashSAC 日志仍写入 `logs/flash_sac/<TaskName>/`。

## 性能检查

多 GPU 主要减少 learner 更新瓶颈。collector 仍是单个 CPU 进程,所以
Expand All @@ -103,7 +116,8 @@ ring-buffer 窗口,让 CPU 随机 gather 与下一次 env step 重叠,同时

## 常见错误

- `Only SAC supports training.num_gpus > 1`:当前只验证 SAC。
- `<Learner> does not support training.num_gpus > 1`:该 learner 尚未声明并验证多
GPU contract。
- `SAC multi-GPU training requires a CUDA device`:没有可用 CUDA,或
`training.device` 被设成了 CPU。
- `set training.num_gpus=1 or algo.use_symmetry=false`:多卡 SAC 暂不支持对称增
Expand Down
7 changes: 4 additions & 3 deletions docs/sphinx/source/zh_CN/2-user_guide/2-algorithms/3-sac.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,11 @@ SAC 通过共享的 off-policy 入口 `scripts/train_offpolicy.py` 选择,TD3
off-policy runner 通过 shared memory 把 CPU 仿真与 GPU 学习解耦:collector 子进程
填充驻留在 CPU 上的 replay buffer,learner 在 GPU 上训练。

SAC 也是当前已验证的 replay-buffer 多 GPU 训练算法。多卡模式通过
默认 FastSAC learner 也是当前已验证的 replay-buffer 多 GPU SAC 实现。多卡模式通过
`training.num_gpus > 1` 打开,host 侧并行打包并分发 batch,多张 GPU 上的 learner
默认使用 `training.multi_gpu_sync_mode=local_sgd` 做 delayed-sync 参数平均。完整命
令、严格同步回退和限制见 {doc}`../1-training/4-multi_gpu`。
默认使用 `training.multi_gpu_sync_mode=local_sgd` 做 delayed-sync 参数平均。custom
SAC runtime 必须显式声明 distributed learner contract 后才能使用这条路径。完整命令、
严格同步回退和限制见 {doc}`../1-training/4-multi_gpu`。

## 快速开始

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,16 @@ uv run train --algo flashsac --task go2_joystick_flat --sim mujoco training.no_p
- `algo.algo_params.actor_num_blocks=2`
- `algo.algo_params.critic_num_blocks=2`

`scripts/train_offpolicy.py` 会拒绝 FlashSAC 的 `training.num_gpus > 1`,因此除非实
现发生变化,否则请保持默认的单 GPU 路径。
FlashSAC 支持共享的 off-policy 多 GPU runner。使用方式:

```bash
uv run train --algo flashsac --task g1_walk_flat --sim mujoco \
training.num_gpus=2 \
training.multi_gpu_sync_mode=local_sgd
```

多 GPU FlashSAC 要求 CUDA 和同步采集。learner 自己拥有分布式同步 hook:
`sync_sgd` 下同步梯度,`local_sgd` 下平均参数和 persistent normalization buffer,
reward normalizer 由 rank0 按 replay 写入顺序更新后广播给其它 rank。

日志根目录为 `logs/flash_sac/<task>/`。
27 changes: 23 additions & 4 deletions scripts/train_offpolicy.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,8 +178,6 @@ def build_runner(algo_name: str, cfg: DictConfig):
num_gpus = int(getattr(cfg.training, "num_gpus", 1))
multi_gpu_sync_mode = str(getattr(cfg.training, "multi_gpu_sync_mode", "local_sgd"))
multi_gpu_sync_interval = int(getattr(cfg.training, "multi_gpu_sync_interval", 1))
if num_gpus > 1 and algo_name != "sac":
raise ValueError("Only SAC supports training.num_gpus > 1 in this validation round")

sync_collection = not bool(cfg.training.no_sync_collection)

Expand Down Expand Up @@ -270,15 +268,25 @@ def build_runner(algo_name: str, cfg: DictConfig):
"critic_obs_dim": _critic_dim,
**_learner_extra_kwargs,
}
_learner = _learner_cls(device=_device, **_learner_kwargs)

if num_gpus > 1:
from unilab.algos.torch.offpolicy.distributed import (
validate_distributed_learner_capability,
)
from unilab.algos.torch.offpolicy.multi_gpu_runner import MultiGPUOffPolicyRunner

if not str(_device).startswith("cuda"):
raise ValueError("SAC multi-GPU training requires a CUDA device")
raise ValueError(f"{_algo_type} multi-GPU training requires a CUDA device")
if not sync_collection:
raise ValueError("Multi-GPU off-policy replay requires synchronized collection")
validate_distributed_learner_capability(
algo_type=_algo_type,
learner_cls=_learner_cls,
learner_kwargs=_learner_kwargs,
num_gpus=num_gpus,
sync_mode=multi_gpu_sync_mode,
)
_learner = _learner_cls(device=_device, **_learner_kwargs)
return MultiGPUOffPolicyRunner(
learner=_learner,
env_name=cfg.training.task_name,
Expand Down Expand Up @@ -312,6 +320,7 @@ def build_runner(algo_name: str, cfg: DictConfig):
nan_guard_cfg=_nan_guard_cfg,
)

_learner = _learner_cls(device=_device, **_learner_kwargs)
return DoubleBufferOffPolicyRunner(
learner=_learner,
env_name=cfg.training.task_name,
Expand Down Expand Up @@ -344,11 +353,21 @@ def build_runner(algo_name: str, cfg: DictConfig):
if algo_name == "td3":
from unilab.algos.torch.common.device import get_env_dims
from unilab.algos.torch.fast_td3.learner import FastTD3Learner
from unilab.algos.torch.offpolicy.distributed import (
validate_distributed_learner_capability,
)
from unilab.algos.torch.offpolicy.double_buffer_runner import (
DoubleBufferOffPolicyRunner,
)
from unilab.utils.device import get_default_device

validate_distributed_learner_capability(
learner_cls=FastTD3Learner,
algo_type="td3",
learner_kwargs={},
num_gpus=num_gpus,
sync_mode=multi_gpu_sync_mode,
)
_device = cfg.training.device or get_default_device()
_obs_dim, _action_dim, _critic_dim = get_env_dims(
cfg.training.task_name,
Expand Down
4 changes: 4 additions & 0 deletions src/unilab/algos/torch/fast_sac/learner.py
Original file line number Diff line number Diff line change
Expand Up @@ -378,6 +378,10 @@ class FastSACLearner:
- Distributional critic (C51, num_atoms=101)
"""

supports_multi_gpu = True
supports_multi_gpu_symmetry = False
supported_multi_gpu_sync_modes = frozenset({"sync_sgd", "local_sgd"})

def __init__(
self,
obs_dim: int,
Expand Down
4 changes: 4 additions & 0 deletions src/unilab/algos/torch/fast_td3/learner.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,10 @@ class FastTD3Learner:
- Observation normalization
"""

supports_multi_gpu = False
supports_multi_gpu_symmetry = False
supported_multi_gpu_sync_modes: frozenset[str] = frozenset()

def __init__(
self,
obs_dim: int,
Expand Down
Loading
Loading