Skip to content

Commit

Permalink
v0.3.0 Release (#68)
Browse files Browse the repository at this point in the history
* .

* .

* .

* .

* fix tests and grouped_gemm installation

* .
  • Loading branch information
garrett4wade authored Sep 5, 2024
1 parent 4fb09bf commit f54a9f1
Show file tree
Hide file tree
Showing 31 changed files with 209 additions and 238 deletions.
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ RUN pip3 install -r /requirements.txt && rm /requirements.txt
RUN pip3 install git+https://github.com/NVIDIA/[email protected] --no-deps --no-build-isolation
RUN pip3 install flash-attn==2.4.2 --no-build-isolation
# Install grouped_gemm for MoE acceleration
RUN pip3 install grouped_gemm
RUN pip3 install git+https://github.com/tgale96/grouped_gemm[email protected] --no-build-isolation --no-deps

COPY . /realhf
RUN REAL_CUDA=1 pip3 install -e /realhf --no-build-isolation
Expand Down
25 changes: 17 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,30 +17,39 @@

***ReaL*** (short for *<ins>ReaL</ins>location*) is a distributed system designed for efficient RLHF training with LLMs. This is the library used to run experiments for the ICML 2024 Oral Paper [Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study](https://arxiv.org/pdf/2404.10719).

ReaL introduces a novel approach called *parameter reallocation*, which dynamically redistributes LLM parameters across the cluster and adapts parallelization strategies during training. By optimizing allocations and parallelism for each computation workload, ReaL minimizes redundant communication while maximizing GPU utilization.

ReaL achieves significantly higher PPO training throughput compared to state-of-the-art open-source systems.
ReaL introduces a novel approach called *parameter reallocation*, which dynamically redistributes LLM parameters across the cluster and adapts parallelization strategies during training. By optimizing allocations and parallelism for each computation workload, ReaL achieves significantly higher PPO training throughput compared to state-of-the-art open-source systems.

(In the following figure, as the number of GPUs increases, the model size scales up from LLaMA 7B, LLaMA 13B, and CodeLLaMA 34B, to the largest LLaMA 70B.)

![Throughput Comparison](docs/source/images/vws.svg)

## News 📢

- **[2024/09/05]** Releasing ReaL v0.3.0 - MoE RLHF, CUDAGraph generation, mini-batched execution, and more customized algorithms.

## Features

- Large-scale and high-throughput SFT/reward modeling/DPO/PPO/generation.
- MoE model training and generation.
- PPO tricks, e.g. GAE, advantage/value normalization, and reference EMA.
- State-of-the-art RLHF algorithms, e.g., [GRPO](https://github.com/openpsi-project/ReaLHF/tree/main/examples/new_algorithms/grpo).

## Highlights

### Efficiency
### 🚀 Efficiency

- Achieves state-of-the-art training throughput for RLHF using **parameter reallocation**.
- Supports large-scale training with 3D parallelism, ZeRO optimization, and sequence parallelism.
- Supports high-throughput generation with CUDAGraph and large-scale training with 3D parallelism.
- Enables memory-efficient training with parameter and optimizer offloading.

### Ease of Use
### Ease of Use

- Seamlessly integrates with HuggingFace checkpoints and inference frameworks like vLLM. No checkpoint conversion required.
- Allows launching local or distributed experiments via [Ray](https://docs.ray.io/en/latest/index.html) or [SLURM](https://slurm.schedmd.com/documentation.html) with a single command.

Check out our [tutorial](https://openpsi-project.github.io/ReaLHF/quickstart.html) to reproduce the full RLHF procedure (SFT/RW/PPO) with 4×LLaMA-7B in just **30 minutes**.

### Flexibility
### 🎯 Flexibility

- Offers versatile configuration customization with Hydra structured config.
- Supports many commonly used RLHF algorithms, including DPO, PPO, RAFT, and more.
Expand All @@ -61,7 +70,7 @@ export MAX_JOBS=8
# GPU dependencies, not required on the launcher node.
pip install git+https://github.com/NVIDIA/[email protected] --no-deps --no-build-isolation
pip install flash_attn==2.4.2 --no-build-isolation
pip install grouped_gemm # For MoE
pip3 install git+https://github.com/tgale96/grouped_gemm[email protected] --no-build-isolation --no-deps # For MoE

REAL_CUDA=1 pip install -e . --no-build-isolation
```
Expand Down
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
project = "ReaL"
copyright = "2024, Wei Fu & Zhiyu Mei"
author = "Wei Fu & Zhiyu Mei"
release = "0.1.0"
release = "0.3.0"

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
Expand Down
11 changes: 11 additions & 0 deletions docs/source/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,3 +77,14 @@ The GitHub Pages will be updated automatically after the PR is merged.
pytest -m "not distributed"
# On a node with multiple GPUs, run all tests
pytest
************************
Building Docker Images
************************

.. code:: bash
# Build the GPU image
docker build -t real-gpu:24.03-0.3.0 -f Dockerfile --target gpu --build-arg REAL_GPU_BASE_IMAGE=nvcr.io/nvidia/pytorch:24.03-py3 --build-arg REAL_CPU_BASE_IMAGE=ubuntu:22.04 .
# Build the CPU image
docker build -t real-cpu:22.04-0.3.0 -f Dockerfile --target cpu --build-arg REAL_GPU_BASE_IMAGE=nvcr.io/nvidia/pytorch:24.03-py3 --build-arg REAL_CPU_BASE_IMAGE=ubuntu:22.04 .
43 changes: 12 additions & 31 deletions docs/source/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,51 +15,32 @@ To pull the images, run:

.. code:: console
$ docker pull docker.io/garrett4wade/real-cpu:22.04-${REAL_VERSION}
$ docker pull docker.io/garrett4wade/real-gpu:23.10-py3-${REAL_VERSION}
$ docker pull docker.io/garrett4wade/real-cpu:22.04-0.3.0
$ docker pull docker.io/garrett4wade/real-gpu:24.03-py3-0.3.0
The CPU image is built from "ubuntu:22.04" and the GPU image is built
from "nvcr.io/nvidia/pytorch:23.10-py3". You can check the latest
package version `here
<https://github.com/openpsi-project/ReaLHF/releases>`_.
from "nvcr.io/nvidia/pytorch:24.03-py3". You can check the latest docker
image version `here
<https://hub.docker.com/r/garrett4wade/real-gpu/tags>`_.

After pulling the Docker images, run your Docker container locally on a
GPU node with the following command:

.. code:: console
$ docker run -it --rm --gpus all garrett4wade/real-gpu:23.10-py3-${REAL_VERSION} bash
$ docker run -it --rm --gpus all --mount type=bind,src=/path/outside/container,dst=/realhf garrett4wade/real-gpu:24.03-py3-0.3.0 bash
The source code is available at ``/realhf`` inside the container. This
is an editable installation, so you can modify the code or run
experiments directly.

If you want to develop the code outside a Docker container, you should
mount the code directory to the container, e.g.,

.. code:: console
$ docker run -it --rm --gpus all --mount type=bind,src=/path/outside/container,dst=/realhf garrett4wade/real-gpu:23.10-py3-${REAL_VERSION} bash
If your destination path is not ``/realhf``, remember to rerun the
editable installation command after mounting:

.. code:: console
$ REAL_CUDA=1 pip install -e /your/mounted/code/path --no-build-isolation
.. note::

The ``REAL_CUDA`` environment variable is used to install the CUDA
extension.
There is an editable installation at ``/realhf`` inside the container,
so your change to the code outside the container should automatically
takes effect.

*****************************
Install From PyPI or Source
*****************************

If you prefer not to use the provided Docker image, you can also start
with an image provided by NVIDA (e.g.,
``nvcr.io/nvidia/pytorch:23.10-py3``) and install ReaL from PyPI or from
``nvcr.io/nvidia/pytorch:24.03-py3``) and install ReaL from PyPI or from
the source.

.. note::
Expand Down Expand Up @@ -89,9 +70,9 @@ On a GPU machine, also install the required runtime packages:
.. code:: console
$ export MAX_JOBS=8 # Set the number of parallel jobs for compilation.
$ pip install git+https://github.com/NVIDIA/TransformerEngine.git@v1.4 --no-deps --no-build-isolation
$ pip install git+https://github.com/NVIDIA/TransformerEngine.git@v1.8 --no-deps --no-build-isolation
$ pip install flash_attn==2.4.2 --no-build-isolation
$ pip install grouped_gemm # For MoE
$ pip3 install git+https://github.com/tgale96/grouped_gemm[email protected] --no-build-isolation --no-deps # For MoE
.. note::

Expand Down
Loading

0 comments on commit f54a9f1

Please sign in to comment.