Skip to content

Commit

Permalink
VIT training example using GKE GCS and Filestore CSI driver (#1085)
Browse files Browse the repository at this point in the history
  • Loading branch information
saikat-royc authored Jan 16, 2024
1 parent 73a2f7d commit e11fbcd
Show file tree
Hide file tree
Showing 23 changed files with 1,372 additions and 0 deletions.
140 changes: 140 additions & 0 deletions demos/vit-training-example/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
### Summary
This is an adaptation of the HuggingFace training example https://huggingface.co/docs/transformers/tasks/image_classification which showcases how to deploy the notebook in GKE and leverage storage like Filestore to accelerate the training.


### Pre-requisites
1. Quota for GPUs (https://cloud.google.com/kubernetes-engine/docs/concepts/gpus#gpu-quota).The default example deploys a 1 node with L4 GPU
2. Minimum 10Ti quota (`HighScaleSSDStorageGibPerRegion`) for Filestore highscale tier (https://cloud.google.com/filestore/docs/service-tiers)

## Prepare GKE Infra
For easy one click deploy, a terraform template has been provided which deploys a GKE cluster with GPU node pool. Additionally the terraform also sets up the [GKE Filestore](https://cloud.google.com/filestore/docs/csi-driver), [GKE GCS Fuse](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver) CSI drivers

> **NOTE:**
> 1. Update the "project_id" in [variables.tf](tf/platform/variables.tf) to your project name
```
$ cd tf/platform
$ terraform init
$ terraform apply
```

> **NOTE:** After terraform apply steps complete successfully, setup the kubeconfig credentials
`gcloud container clusters get-credentials ml-vit-luster --zone=us-central1-c`

## Prepare GCS specific infra
This step is needed only to deploy the Jupyter [pod spec](yamls/spec-gcs.yaml) which mounts GCS Buckets using [GKE GCS Fuse](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver) CSI Driver. For trying out only Filestore based [example](yamls/spec-filestore.yaml), this section can be skipped

1. Setup GCS specific env variables for ease of use
```
GCS_BUCKET_NAME=<your-bucket-name>
GCS_GCP_SA=<your-gcp-sa-name> # this is the GCP service account to prepare the [WorkloadIdentity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) bindings so that the k8s pod can access the bucket
GCSFUSE_KSA=<your-k8s-sa-name> # this is k8s service account which binds to the GCP SA
gcloud storage buckets create gs://$GCS_BUCKET_NAME
```

2. Ensure the same names are setup for the variables "service_account" (=GCS_GCP_SA), "k8s_service_account" (=GCSFUSE_KSA), "gcs_bucket" (=GCS_BUCKET_NAME) in [user/variables.tf](tf/user/variables.tf)
```
$ cd tf/user
$ terraform init
$ terraform apply
```

After terraform apply steps complete successfully, the desired service account bindings are created. The terraform automates the steps documented [here](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#authentication)

## Deploy workloads

### Deploy Jupyter Pod using GCS Buckets

1. Replace the GCS_BUCKET_NAME and GCSFUSE_KSA variables
```
sed -i "s/<GCS_BUCKET_NAME>/$GCS_BUCKET_NAME/g" yamls/spec-gcs.yaml
sed -i "s/<GCSFUSE_KSA>/$GCSFUSE_KSA/g" ./spec-gcs.yaml
```

2. Deploy the [podspec](yamls/spec-gcs.yaml)
```
kubectl apply -f yamls/spec-gcs.yaml
```

3. Setup the context for the namespace (this is the namespace created by terraform based on the "namespace" variable in [user/variables.tf](tf/user/variables.tf))

```
kubectl config set-context --current --namespace example
```

4. Verify jupyter pod is up and running and fetch the LB Ip and the necessary token
```
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/tensorflow-0 2/2 Running 0 77m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/tensorflow ClusterIP None <none> 8888/TCP 77m
service/tensorflow-jupyter LoadBalancer 10.8.17.169 35.224.15.129 80:32731/TCP 77m
NAME READY AGE
statefulset.apps/tensorflow 1/1 77m
$ kubectl exec --tty -i tensorflow-0 -c tensorflow-container -n example -- jupyter notebook list
Currently running servers:
http://0.0.0.0:8888/?token=<hash> :: /tf
```

4. In your web browser use the external IP of the tensorflow-jupyter service and login to the notebook using the token


### Deploy Jupyter Pod using Filestore

1. Deploy the [podspec](yamls/spec-filestore.yaml)
```
kubectl apply -f yamls/spec-gcs.yaml
```

2. Setup the context for the namespace (this is the namespace created by terraform based on the "namespace" variable in [user/variables.tf](tf/user/variables.tf))

```
kubectl config set-context --current --namespace example
```

3. Verify jupyter pod is up and running and fetch the LB Ip and the necessary token
```
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/tensorflow-0 2/2 Running 0 77m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/tensorflow ClusterIP None <none> 8888/TCP 77m
service/tensorflow-jupyter LoadBalancer 10.8.17.169 35.224.15.129 80:32731/TCP 77m
NAME READY AGE
statefulset.apps/tensorflow 1/1 77m
$ kubectl exec --tty -i tensorflow-0 -c tensorflow-container -n example -- jupyter notebook list
Currently running servers:
http://0.0.0.0:8888/?token=<hash> :: /tf
```

4. In your web browser use the external IP of the tensorflow-jupyter service and login to the notebook using the token

### Run the notebook
The notebook which runs the training can be found [here](notebooks/ViTClassfication-v1.ipynb)

## Teardown

1. teardown the SA, bindings
```
$ cd tf/user
$ terraform destroy
```

2. teardown the cluster
```
$ cd tf/user
$ terraform destroy
```
Loading

0 comments on commit e11fbcd

Please sign in to comment.