-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* . * . * . * . * fix tests and grouped_gemm installation * .
- Loading branch information
1 parent
4fb09bf
commit f54a9f1
Showing
31 changed files
with
209 additions
and
238 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -61,7 +61,7 @@ RUN pip3 install -r /requirements.txt && rm /requirements.txt | |
RUN pip3 install git+https://github.com/NVIDIA/[email protected] --no-deps --no-build-isolation | ||
RUN pip3 install flash-attn==2.4.2 --no-build-isolation | ||
# Install grouped_gemm for MoE acceleration | ||
RUN pip3 install grouped_gemm | ||
RUN pip3 install git+https://github.com/tgale96/grouped_gemm[email protected] --no-build-isolation --no-deps | ||
|
||
COPY . /realhf | ||
RUN REAL_CUDA=1 pip3 install -e /realhf --no-build-isolation | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,30 +17,39 @@ | |
|
||
***ReaL*** (short for *<ins>ReaL</ins>location*) is a distributed system designed for efficient RLHF training with LLMs. This is the library used to run experiments for the ICML 2024 Oral Paper [Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study](https://arxiv.org/pdf/2404.10719). | ||
|
||
ReaL introduces a novel approach called *parameter reallocation*, which dynamically redistributes LLM parameters across the cluster and adapts parallelization strategies during training. By optimizing allocations and parallelism for each computation workload, ReaL minimizes redundant communication while maximizing GPU utilization. | ||
|
||
ReaL achieves significantly higher PPO training throughput compared to state-of-the-art open-source systems. | ||
ReaL introduces a novel approach called *parameter reallocation*, which dynamically redistributes LLM parameters across the cluster and adapts parallelization strategies during training. By optimizing allocations and parallelism for each computation workload, ReaL achieves significantly higher PPO training throughput compared to state-of-the-art open-source systems. | ||
|
||
(In the following figure, as the number of GPUs increases, the model size scales up from LLaMA 7B, LLaMA 13B, and CodeLLaMA 34B, to the largest LLaMA 70B.) | ||
|
||
![Throughput Comparison](docs/source/images/vws.svg) | ||
|
||
## News 📢 | ||
|
||
- **[2024/09/05]** Releasing ReaL v0.3.0 - MoE RLHF, CUDAGraph generation, mini-batched execution, and more customized algorithms. | ||
|
||
## Features | ||
|
||
- Large-scale and high-throughput SFT/reward modeling/DPO/PPO/generation. | ||
- MoE model training and generation. | ||
- PPO tricks, e.g. GAE, advantage/value normalization, and reference EMA. | ||
- State-of-the-art RLHF algorithms, e.g., [GRPO](https://github.com/openpsi-project/ReaLHF/tree/main/examples/new_algorithms/grpo). | ||
|
||
## Highlights | ||
|
||
### Efficiency | ||
### 🚀 Efficiency | ||
|
||
- Achieves state-of-the-art training throughput for RLHF using **parameter reallocation**. | ||
- Supports large-scale training with 3D parallelism, ZeRO optimization, and sequence parallelism. | ||
- Supports high-throughput generation with CUDAGraph and large-scale training with 3D parallelism. | ||
- Enables memory-efficient training with parameter and optimizer offloading. | ||
|
||
### Ease of Use | ||
### ✨ Ease of Use | ||
|
||
- Seamlessly integrates with HuggingFace checkpoints and inference frameworks like vLLM. No checkpoint conversion required. | ||
- Allows launching local or distributed experiments via [Ray](https://docs.ray.io/en/latest/index.html) or [SLURM](https://slurm.schedmd.com/documentation.html) with a single command. | ||
|
||
Check out our [tutorial](https://openpsi-project.github.io/ReaLHF/quickstart.html) to reproduce the full RLHF procedure (SFT/RW/PPO) with 4×LLaMA-7B in just **30 minutes**. | ||
|
||
### Flexibility | ||
### 🎯 Flexibility | ||
|
||
- Offers versatile configuration customization with Hydra structured config. | ||
- Supports many commonly used RLHF algorithms, including DPO, PPO, RAFT, and more. | ||
|
@@ -61,7 +70,7 @@ export MAX_JOBS=8 | |
# GPU dependencies, not required on the launcher node. | ||
pip install git+https://github.com/NVIDIA/[email protected] --no-deps --no-build-isolation | ||
pip install flash_attn==2.4.2 --no-build-isolation | ||
pip install grouped_gemm # For MoE | ||
pip3 install git+https://github.com/tgale96/grouped_gemm[email protected] --no-build-isolation --no-deps # For MoE | ||
|
||
REAL_CUDA=1 pip install -e . --no-build-isolation | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,51 +15,32 @@ To pull the images, run: | |
|
||
.. code:: console | ||
$ docker pull docker.io/garrett4wade/real-cpu:22.04-${REAL_VERSION} | ||
$ docker pull docker.io/garrett4wade/real-gpu:23.10-py3-${REAL_VERSION} | ||
$ docker pull docker.io/garrett4wade/real-cpu:22.04-0.3.0 | ||
$ docker pull docker.io/garrett4wade/real-gpu:24.03-py3-0.3.0 | ||
The CPU image is built from "ubuntu:22.04" and the GPU image is built | ||
from "nvcr.io/nvidia/pytorch:23.10-py3". You can check the latest | ||
package version `here | ||
<https://github.com/openpsi-project/ReaLHF/releases>`_. | ||
from "nvcr.io/nvidia/pytorch:24.03-py3". You can check the latest docker | ||
image version `here | ||
<https://hub.docker.com/r/garrett4wade/real-gpu/tags>`_. | ||
|
||
After pulling the Docker images, run your Docker container locally on a | ||
GPU node with the following command: | ||
|
||
.. code:: console | ||
$ docker run -it --rm --gpus all garrett4wade/real-gpu:23.10-py3-${REAL_VERSION} bash | ||
$ docker run -it --rm --gpus all --mount type=bind,src=/path/outside/container,dst=/realhf garrett4wade/real-gpu:24.03-py3-0.3.0 bash | ||
The source code is available at ``/realhf`` inside the container. This | ||
is an editable installation, so you can modify the code or run | ||
experiments directly. | ||
|
||
If you want to develop the code outside a Docker container, you should | ||
mount the code directory to the container, e.g., | ||
|
||
.. code:: console | ||
$ docker run -it --rm --gpus all --mount type=bind,src=/path/outside/container,dst=/realhf garrett4wade/real-gpu:23.10-py3-${REAL_VERSION} bash | ||
If your destination path is not ``/realhf``, remember to rerun the | ||
editable installation command after mounting: | ||
|
||
.. code:: console | ||
$ REAL_CUDA=1 pip install -e /your/mounted/code/path --no-build-isolation | ||
.. note:: | ||
|
||
The ``REAL_CUDA`` environment variable is used to install the CUDA | ||
extension. | ||
There is an editable installation at ``/realhf`` inside the container, | ||
so your change to the code outside the container should automatically | ||
takes effect. | ||
|
||
***************************** | ||
Install From PyPI or Source | ||
***************************** | ||
|
||
If you prefer not to use the provided Docker image, you can also start | ||
with an image provided by NVIDA (e.g., | ||
``nvcr.io/nvidia/pytorch:23.10-py3``) and install ReaL from PyPI or from | ||
``nvcr.io/nvidia/pytorch:24.03-py3``) and install ReaL from PyPI or from | ||
the source. | ||
|
||
.. note:: | ||
|
@@ -89,9 +70,9 @@ On a GPU machine, also install the required runtime packages: | |
.. code:: console | ||
$ export MAX_JOBS=8 # Set the number of parallel jobs for compilation. | ||
$ pip install git+https://github.com/NVIDIA/TransformerEngine.git@v1.4 --no-deps --no-build-isolation | ||
$ pip install git+https://github.com/NVIDIA/TransformerEngine.git@v1.8 --no-deps --no-build-isolation | ||
$ pip install flash_attn==2.4.2 --no-build-isolation | ||
$ pip install grouped_gemm # For MoE | ||
$ pip3 install git+https://github.com/tgale96/grouped_gemm[email protected] --no-build-isolation --no-deps # For MoE | ||
.. note:: | ||
|
||
|
Oops, something went wrong.