v0.3.0 Release (#68)

* . * . * . * . * fix tests and grouped_gemm installation * .
openpsi-project · Sep 5, 2024 · f54a9f1 · f54a9f1
1 parent 4fb09bf
commit f54a9f1
Show file tree

Hide file tree

Showing 31 changed files with 209 additions and 238 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -61,7 +61,7 @@ RUN pip3 install -r /requirements.txt && rm /requirements.txt
 RUN pip3 install git+https://github.com/NVIDIA/[email protected] --no-deps --no-build-isolation
 RUN pip3 install flash-attn==2.4.2 --no-build-isolation
 # Install grouped_gemm for MoE acceleration
-RUN pip3 install grouped_gemm
+RUN pip3 install git+https://github.com/tgale96/grouped_gemm[email protected] --no-build-isolation --no-deps
 
 COPY . /realhf
 RUN REAL_CUDA=1 pip3 install -e /realhf --no-build-isolation

diff --git a/README.md b/README.md
@@ -17,30 +17,39 @@
 
 ***ReaL*** (short for *<ins>ReaL</ins>location*) is a distributed system designed for efficient RLHF training with LLMs. This is the library used to run experiments for the ICML 2024 Oral Paper [Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study](https://arxiv.org/pdf/2404.10719). 
 
-ReaL introduces a novel approach called *parameter reallocation*, which dynamically redistributes LLM parameters across the cluster and adapts parallelization strategies during training. By optimizing allocations and parallelism for each computation workload, ReaL minimizes redundant communication while maximizing GPU utilization.
-
-ReaL achieves significantly higher PPO training throughput compared to state-of-the-art open-source systems.
+ReaL introduces a novel approach called *parameter reallocation*, which dynamically redistributes LLM parameters across the cluster and adapts parallelization strategies during training. By optimizing allocations and parallelism for each computation workload, ReaL achieves significantly higher PPO training throughput compared to state-of-the-art open-source systems.
 
 (In the following figure, as the number of GPUs increases, the model size scales up from LLaMA 7B, LLaMA 13B, and CodeLLaMA 34B, to the largest LLaMA 70B.)
 
 ![Throughput Comparison](docs/source/images/vws.svg)
 
+## News 📢
+
+- **[2024/09/05]** Releasing ReaL v0.3.0 - MoE RLHF, CUDAGraph generation, mini-batched execution, and more customized algorithms.
+
+## Features
+
+- Large-scale and high-throughput SFT/reward modeling/DPO/PPO/generation.
+- MoE model training and generation.
+- PPO tricks, e.g. GAE, advantage/value normalization, and reference EMA.
+- State-of-the-art RLHF algorithms, e.g., [GRPO](https://github.com/openpsi-project/ReaLHF/tree/main/examples/new_algorithms/grpo).
+
 ## Highlights
 
-### Efficiency
+### 🚀 Efficiency
 
 - Achieves state-of-the-art training throughput for RLHF using **parameter reallocation**.
-- Supports large-scale training with 3D parallelism, ZeRO optimization, and sequence parallelism.
+- Supports high-throughput generation with CUDAGraph and large-scale training with 3D parallelism.
 - Enables memory-efficient training with parameter and optimizer offloading.
 
-### Ease of Use
+### ✨ Ease of Use
 
 - Seamlessly integrates with HuggingFace checkpoints and inference frameworks like vLLM. No checkpoint conversion required.
 - Allows launching local or distributed experiments via [Ray](https://docs.ray.io/en/latest/index.html) or [SLURM](https://slurm.schedmd.com/documentation.html) with a single command.
 
 Check out our [tutorial](https://openpsi-project.github.io/ReaLHF/quickstart.html) to reproduce the full RLHF procedure (SFT/RW/PPO) with 4×LLaMA-7B in just **30 minutes**.
 
-### Flexibility
+### 🎯 Flexibility
 
 - Offers versatile configuration customization with Hydra structured config.
 - Supports many commonly used RLHF algorithms, including DPO, PPO, RAFT, and more.
@@ -61,7 +70,7 @@ export MAX_JOBS=8
 # GPU dependencies, not required on the launcher node.
 pip install git+https://github.com/NVIDIA/[email protected] --no-deps --no-build-isolation
 pip install flash_attn==2.4.2 --no-build-isolation 
-pip install grouped_gemm  # For MoE
+pip3 install git+https://github.com/tgale96/grouped_gemm[email protected] --no-build-isolation --no-deps  # For MoE
 
 REAL_CUDA=1 pip install -e . --no-build-isolation
 ```

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -16,7 +16,7 @@
 project = "ReaL"
 copyright = "2024, Wei Fu & Zhiyu Mei"
 author = "Wei Fu & Zhiyu Mei"
-release = "0.1.0"
+release = "0.3.0"
 
 # -- General configuration ---------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst
@@ -77,3 +77,14 @@ The GitHub Pages will be updated automatically after the PR is merged.
    pytest -m "not distributed"
    # On a node with multiple GPUs, run all tests
    pytest
+
+************************
+ Building Docker Images
+************************
+
+.. code:: bash
+
+   # Build the GPU image
+   docker build -t real-gpu:24.03-0.3.0 -f Dockerfile --target gpu --build-arg REAL_GPU_BASE_IMAGE=nvcr.io/nvidia/pytorch:24.03-py3 --build-arg REAL_CPU_BASE_IMAGE=ubuntu:22.04 .
+   # Build the CPU image
+   docker build -t real-cpu:22.04-0.3.0 -f Dockerfile --target cpu --build-arg REAL_GPU_BASE_IMAGE=nvcr.io/nvidia/pytorch:24.03-py3 --build-arg REAL_CPU_BASE_IMAGE=ubuntu:22.04 .
diff --git a/docs/source/install.rst b/docs/source/install.rst
@@ -15,51 +15,32 @@ To pull the images, run:
 
 .. code:: console
 
-   $ docker pull docker.io/garrett4wade/real-cpu:22.04-${REAL_VERSION}
-   $ docker pull docker.io/garrett4wade/real-gpu:23.10-py3-${REAL_VERSION}
+   $ docker pull docker.io/garrett4wade/real-cpu:22.04-0.3.0
+   $ docker pull docker.io/garrett4wade/real-gpu:24.03-py3-0.3.0
 
 The CPU image is built from "ubuntu:22.04" and the GPU image is built
-from "nvcr.io/nvidia/pytorch:23.10-py3". You can check the latest
-package version `here
-<https://github.com/openpsi-project/ReaLHF/releases>`_.
+from "nvcr.io/nvidia/pytorch:24.03-py3". You can check the latest docker
+image version `here
+<https://hub.docker.com/r/garrett4wade/real-gpu/tags>`_.
 
 After pulling the Docker images, run your Docker container locally on a
 GPU node with the following command:
 
 .. code:: console
 
-   $ docker run -it --rm --gpus all garrett4wade/real-gpu:23.10-py3-${REAL_VERSION} bash
+   $ docker run -it --rm --gpus all --mount type=bind,src=/path/outside/container,dst=/realhf garrett4wade/real-gpu:24.03-py3-0.3.0 bash
 
-The source code is available at ``/realhf`` inside the container. This
-is an editable installation, so you can modify the code or run
-experiments directly.
-
-If you want to develop the code outside a Docker container, you should
-mount the code directory to the container, e.g.,
-
-.. code:: console
-
-   $ docker run -it --rm --gpus all --mount type=bind,src=/path/outside/container,dst=/realhf garrett4wade/real-gpu:23.10-py3-${REAL_VERSION} bash
-
-If your destination path is not ``/realhf``, remember to rerun the
-editable installation command after mounting:
-
-.. code:: console
-
-   $ REAL_CUDA=1 pip install -e /your/mounted/code/path --no-build-isolation
-
-.. note::
-
-   The ``REAL_CUDA`` environment variable is used to install the CUDA
-   extension.
+There is an editable installation at ``/realhf`` inside the container,
+so your change to the code outside the container should automatically
+takes effect.
 
 *****************************
  Install From PyPI or Source
 *****************************
 
 If you prefer not to use the provided Docker image, you can also start
 with an image provided by NVIDA (e.g.,
-``nvcr.io/nvidia/pytorch:23.10-py3``) and install ReaL from PyPI or from
+``nvcr.io/nvidia/pytorch:24.03-py3``) and install ReaL from PyPI or from
 the source.
 
 .. note::
@@ -89,9 +70,9 @@ On a GPU machine, also install the required runtime packages:
 .. code:: console
 
    $ export MAX_JOBS=8  # Set the number of parallel jobs for compilation.
-   $ pip install git+https://github.com/NVIDIA/TransformerEngine.git@v1.4 --no-deps --no-build-isolation
+   $ pip install git+https://github.com/NVIDIA/TransformerEngine.git@v1.8 --no-deps --no-build-isolation
    $ pip install flash_attn==2.4.2 --no-build-isolation
-   $ pip install grouped_gemm  # For MoE
+   $ pip3 install git+https://github.com/tgale96/grouped_gemm[email protected] --no-build-isolation --no-deps  # For MoE
 
 .. note::