Add DeepSpeed Example with Pytorch Operator (#2235)

Syulin7 · web-flow · commit 2d58b49071d1 · 2024-10-17T17:10:19.000Z
Signed-off-by: Syulin7 &lt;735122171@qq.com&gt;
diff --git a/.github/workflows/publish-example-images.yaml b/.github/workflows/publish-example-images.yaml
@@ -73,3 +73,7 @@ jobs:
             platforms: linux/amd64,linux/arm64
             dockerfile: examples/jax/cpu-demo/Dockerfile
             context: examples/jax/cpu-demo
+          - component-name: pytorch-deepspeed-demo
+            platforms: linux/amd64
+            dockerfile: examples/pytorch/deepspeed-demo/Dockerfile
+            context: examples/pytorch/deepspeed-demo
diff --git a/examples/pytorch/deepspeed-demo/Dockerfile b/examples/pytorch/deepspeed-demo/Dockerfile
@@ -0,0 +1,11 @@
+FROM deepspeed/deepspeed:v072_torch112_cu117
+
+RUN apt update
+RUN apt install -y ninja-build
+
+WORKDIR /
+COPY requirements.txt .
+COPY train_bert_ds.py .
+
+RUN pip install -r requirements.txt
+RUN mkdir -p /root/deepspeed_data
diff --git a/examples/pytorch/deepspeed-demo/README.md b/examples/pytorch/deepspeed-demo/README.md
@@ -0,0 +1,37 @@
+## Training a Masked Language Model with PyTorch and DeepSpeed
+
+This folder contains an example of training a Masked Language Model with PyTorch and DeepSpeed.
+
+The python script used to train BERT with PyTorch and DeepSpeed. For more information, please refer to the [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/blob/master/training/HelloDeepSpeed/README.md).
+
+DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate.
+See [deepspeed](https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#deployment).
+
+This guide will show you how to deploy DeepSpeed with the `torchrun` launcher.
+The simplest way to quickly reproduce the following is to switch to the DeepSpeedExamples commit:
+```shell
+git clone https://github.com/microsoft/DeepSpeedExamples.git
+cd DeepSpeedExamples
+git checkout efacebb
+```
+
+The script train_bert_ds.py is located in the DeepSpeedExamples/HelloDeepSpeed/ directory.
+Since the script is not launched using the deepspeed launcher, it needs to read the local_rank from the environment.
+The following content has been added at line 670:
+```
+local_rank = int(os.getenv('LOCAL_RANK', '-1'))
+```
+
+### Build Image
+
+The default image name and tag is `kubeflow/pytorch-deepspeed-demo:latest`.
+
+```shell
+docker build -f Dockerfile -t kubeflow/pytorch-deepspeed-demo:latest ./
+```
+
+### Create the PyTorchJob with DeepSpeed example
+
+```shell
+kubectl create -f pytorch_deepspeed_demo.yaml
+```
diff --git a/examples/pytorch/deepspeed-demo/pytorch_deepspeed_demo.yaml b/examples/pytorch/deepspeed-demo/pytorch_deepspeed_demo.yaml
@@ -0,0 +1,38 @@
+apiVersion: "kubeflow.org/v1"
+kind: PyTorchJob
+metadata:
+  name: pytorch-deepspeed-demo
+spec:
+  pytorchReplicaSpecs:
+    Master:
+      replicas: 1
+      restartPolicy: OnFailure
+      template:
+        spec:
+          containers:
+            - name: pytorch
+              image: kubeflow/pytorch-deepspeed-demo:latest
+              command:
+                - torchrun
+                - /train_bert_ds.py
+                - --checkpoint_dir
+                - /root/deepspeed_data
+              resources:
+                limits:
+                  nvidia.com/gpu: 1
+    Worker:
+      replicas: 1
+      restartPolicy: OnFailure
+      template:
+        spec:
+          containers:
+            - name: pytorch
+              image: kubeflow/pytorch-deepspeed-demo:latest
+              command:
+                - torchrun
+                - /train_bert_ds.py
+                - --checkpoint_dir
+                - /root/deepspeed_data
+              resources:
+                limits:
+                  nvidia.com/gpu: 1
diff --git a/examples/pytorch/deepspeed-demo/requirements.txt b/examples/pytorch/deepspeed-demo/requirements.txt
@@ -0,0 +1,8 @@
+datasets==1.13.3
+transformers==4.5.1
+fire==0.4.0
+pytz==2021.1
+loguru==0.5.3
+sh==1.14.2
+pytest==6.2.5
+tqdm==4.62.3
diff --git a/examples/pytorch/deepspeed-demo/train_bert_ds.py b/examples/pytorch/deepspeed-demo/train_bert_ds.py