diff --git a/demos/vit-training-example/README.md b/demos/vit-training-example/README.md new file mode 100644 index 0000000000..d2ea853222 --- /dev/null +++ b/demos/vit-training-example/README.md @@ -0,0 +1,140 @@ +### Summary +This is an adaptation of the HuggingFace training example https://huggingface.co/docs/transformers/tasks/image_classification which showcases how to deploy the notebook in GKE and leverage storage like Filestore to accelerate the training. + + +### Pre-requisites +1. Quota for GPUs (https://cloud.google.com/kubernetes-engine/docs/concepts/gpus#gpu-quota).The default example deploys a 1 node with L4 GPU +2. Minimum 10Ti quota (`HighScaleSSDStorageGibPerRegion`) for Filestore highscale tier (https://cloud.google.com/filestore/docs/service-tiers) + +## Prepare GKE Infra + For easy one click deploy, a terraform template has been provided which deploys a GKE cluster with GPU node pool. Additionally the terraform also sets up the [GKE Filestore](https://cloud.google.com/filestore/docs/csi-driver), [GKE GCS Fuse](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver) CSI drivers + +> **NOTE:** +> 1. Update the "project_id" in [variables.tf](tf/platform/variables.tf) to your project name + +``` + $ cd tf/platform + $ terraform init + $ terraform apply +``` + + > **NOTE:** After terraform apply steps complete successfully, setup the kubeconfig credentials + `gcloud container clusters get-credentials ml-vit-luster --zone=us-central1-c` + +## Prepare GCS specific infra + This step is needed only to deploy the Jupyter [pod spec](yamls/spec-gcs.yaml) which mounts GCS Buckets using [GKE GCS Fuse](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver) CSI Driver. For trying out only Filestore based [example](yamls/spec-filestore.yaml), this section can be skipped + + 1. Setup GCS specific env variables for ease of use + ``` + GCS_BUCKET_NAME= + + GCS_GCP_SA= # this is the GCP service account to prepare the [WorkloadIdentity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) bindings so that the k8s pod can access the bucket + + GCSFUSE_KSA= # this is k8s service account which binds to the GCP SA + + gcloud storage buckets create gs://$GCS_BUCKET_NAME + + ``` + + 2. Ensure the same names are setup for the variables "service_account" (=GCS_GCP_SA), "k8s_service_account" (=GCSFUSE_KSA), "gcs_bucket" (=GCS_BUCKET_NAME) in [user/variables.tf](tf/user/variables.tf) + ``` + $ cd tf/user + $ terraform init + $ terraform apply + ``` + +After terraform apply steps complete successfully, the desired service account bindings are created. The terraform automates the steps documented [here](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#authentication) + +## Deploy workloads + +### Deploy Jupyter Pod using GCS Buckets + +1. Replace the GCS_BUCKET_NAME and GCSFUSE_KSA variables +``` +sed -i "s//$GCS_BUCKET_NAME/g" yamls/spec-gcs.yaml +sed -i "s//$GCSFUSE_KSA/g" ./spec-gcs.yaml +``` + +2. Deploy the [podspec](yamls/spec-gcs.yaml) +``` +kubectl apply -f yamls/spec-gcs.yaml +``` + +3. Setup the context for the namespace (this is the namespace created by terraform based on the "namespace" variable in [user/variables.tf](tf/user/variables.tf)) + +``` +kubectl config set-context --current --namespace example +``` + +4. Verify jupyter pod is up and running and fetch the LB Ip and the necessary token +``` +$ kubectl get all +NAME READY STATUS RESTARTS AGE +pod/tensorflow-0 2/2 Running 0 77m + +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE +service/tensorflow ClusterIP None 8888/TCP 77m +service/tensorflow-jupyter LoadBalancer 10.8.17.169 35.224.15.129 80:32731/TCP 77m + +NAME READY AGE +statefulset.apps/tensorflow 1/1 77m + + +$ kubectl exec --tty -i tensorflow-0 -c tensorflow-container -n example -- jupyter notebook list +Currently running servers: +http://0.0.0.0:8888/?token= :: /tf +``` + +4. In your web browser use the external IP of the tensorflow-jupyter service and login to the notebook using the token + + +### Deploy Jupyter Pod using Filestore + +1. Deploy the [podspec](yamls/spec-filestore.yaml) +``` +kubectl apply -f yamls/spec-gcs.yaml +``` + +2. Setup the context for the namespace (this is the namespace created by terraform based on the "namespace" variable in [user/variables.tf](tf/user/variables.tf)) + +``` +kubectl config set-context --current --namespace example +``` + +3. Verify jupyter pod is up and running and fetch the LB Ip and the necessary token +``` +$ kubectl get all +NAME READY STATUS RESTARTS AGE +pod/tensorflow-0 2/2 Running 0 77m + +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE +service/tensorflow ClusterIP None 8888/TCP 77m +service/tensorflow-jupyter LoadBalancer 10.8.17.169 35.224.15.129 80:32731/TCP 77m + +NAME READY AGE +statefulset.apps/tensorflow 1/1 77m + + +$ kubectl exec --tty -i tensorflow-0 -c tensorflow-container -n example -- jupyter notebook list +Currently running servers: +http://0.0.0.0:8888/?token= :: /tf +``` + +4. In your web browser use the external IP of the tensorflow-jupyter service and login to the notebook using the token + +### Run the notebook +The notebook which runs the training can be found [here](notebooks/ViTClassfication-v1.ipynb) + +## Teardown + +1. teardown the SA, bindings +``` +$ cd tf/user +$ terraform destroy +``` + +2. teardown the cluster +``` +$ cd tf/user +$ terraform destroy +``` \ No newline at end of file diff --git a/demos/vit-training-example/notebooks/ViTClassfication-v1.ipynb b/demos/vit-training-example/notebooks/ViTClassfication-v1.ipynb new file mode 100644 index 0000000000..b5bb201ca6 --- /dev/null +++ b/demos/vit-training-example/notebooks/ViTClassfication-v1.ipynb @@ -0,0 +1,300 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "8d1b5b77", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install transformers datasets evaluate\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10e89d62", + "metadata": {}, + "outputs": [], + "source": [ + "from huggingface_hub import notebook_login\n", + "\n", + "notebook_login()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4c0e38bb", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "CACHE_DIR = os.getenv('CACHE_DIR')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1ce6603f", + "metadata": {}, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "food = load_dataset(\"food101\", split=\"train\", cache_dir=CACHE_DIR)\n", + "food" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2936aa74", + "metadata": {}, + "outputs": [], + "source": [ + "food = food.train_test_split(test_size=0.001)\n", + "food" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b89161d0", + "metadata": {}, + "outputs": [], + "source": [ + "labels = food[\"train\"].features[\"label\"].names\n", + "label2id, id2label = dict(), dict()\n", + "for i, label in enumerate(labels):\n", + " label2id[label] = str(i)\n", + " id2label[str(i)] = label" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f3699b52", + "metadata": {}, + "outputs": [], + "source": [ + "id2label[str(79)]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b6793fde", + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoImageProcessor\n", + "\n", + "checkpoint = \"google/vit-base-patch16-224-in21k\"\n", + "image_processor = AutoImageProcessor.from_pretrained(checkpoint)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c3318524", + "metadata": {}, + "outputs": [], + "source": [ + "from tensorflow import keras\n", + "from tensorflow.keras import layers\n", + "\n", + "size = (image_processor.size[\"height\"], image_processor.size[\"width\"])\n", + "\n", + "train_data_augmentation = keras.Sequential(\n", + " [\n", + " layers.RandomCrop(size[0], size[1]),\n", + " layers.Rescaling(scale=1.0 / 127.5, offset=-1),\n", + " layers.RandomFlip(\"horizontal\"),\n", + " layers.RandomRotation(factor=0.02),\n", + " layers.RandomZoom(height_factor=0.2, width_factor=0.2),\n", + " ],\n", + " name=\"train_data_augmentation\",\n", + ")\n", + "\n", + "val_data_augmentation = keras.Sequential(\n", + " [\n", + " layers.CenterCrop(size[0], size[1]),\n", + " layers.Rescaling(scale=1.0 / 127.5, offset=-1),\n", + " ],\n", + " name=\"val_data_augmentation\",\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "20f22425", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import tensorflow as tf\n", + "from PIL import Image\n", + "\n", + "\n", + "def convert_to_tf_tensor(image: Image):\n", + " np_image = np.array(image)\n", + " tf_image = tf.convert_to_tensor(np_image)\n", + " # `expand_dims()` is used to add a batch dimension since\n", + " # the TF augmentation layers operates on batched inputs.\n", + " return tf.expand_dims(tf_image, 0)\n", + "\n", + "\n", + "def preprocess_train(example_batch):\n", + " \"\"\"Apply train_transforms across a batch.\"\"\"\n", + " images = [\n", + " train_data_augmentation(convert_to_tf_tensor(image.convert(\"RGB\"))) for image in example_batch[\"image\"]\n", + " ]\n", + " example_batch[\"pixel_values\"] = [tf.transpose(tf.squeeze(image)) for image in images]\n", + " return example_batch\n", + "\n", + "\n", + "def preprocess_val(example_batch):\n", + " \"\"\"Apply val_transforms across a batch.\"\"\"\n", + " images = [\n", + " val_data_augmentation(convert_to_tf_tensor(image.convert(\"RGB\"))) for image in example_batch[\"image\"]\n", + " ]\n", + " example_batch[\"pixel_values\"] = [tf.transpose(tf.squeeze(image)) for image in images]\n", + " return example_batch" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e66591d", + "metadata": {}, + "outputs": [], + "source": [ + "food[\"train\"].set_transform(preprocess_train)\n", + "food[\"test\"].set_transform(preprocess_val)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "89843967", + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import DefaultDataCollator\n", + "\n", + "data_collator = DefaultDataCollator(return_tensors=\"tf\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6fa86c2c", + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import create_optimizer\n", + "\n", + "batch_size = 16\n", + "num_epochs = 2\n", + "num_train_steps = len(food[\"train\"]) * num_epochs\n", + "learning_rate = 3e-5\n", + "weight_decay_rate = 0.01\n", + "\n", + "optimizer, lr_schedule = create_optimizer(\n", + " init_lr=learning_rate,\n", + " num_train_steps=num_train_steps,\n", + " weight_decay_rate=weight_decay_rate,\n", + " num_warmup_steps=0,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a1eb2cc4", + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import TFAutoModelForImageClassification\n", + "\n", + "model = TFAutoModelForImageClassification.from_pretrained(\n", + " checkpoint,\n", + " id2label=id2label,\n", + " label2id=label2id,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4ab33bfb", + "metadata": {}, + "outputs": [], + "source": [ + "# converting our train dataset to tf.data.Dataset\n", + "tf_train_dataset = food[\"train\"].to_tf_dataset(\n", + " columns=\"pixel_values\", label_cols=\"label\", shuffle=True, batch_size=batch_size, collate_fn=data_collator\n", + ")\n", + "\n", + "# converting our test dataset to tf.data.Dataset\n", + "tf_eval_dataset = food[\"test\"].to_tf_dataset(\n", + " columns=\"pixel_values\", label_cols=\"label\", shuffle=True, batch_size=batch_size, collate_fn=data_collator\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6594267f", + "metadata": {}, + "outputs": [], + "source": [ + "from tensorflow.keras.losses import SparseCategoricalCrossentropy\n", + "\n", + "loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)\n", + "model.compile(optimizer=optimizer, loss=loss)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "291534b2", + "metadata": {}, + "outputs": [], + "source": [ + "model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=num_epochs)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a570d575", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/demos/vit-training-example/tf/platform/main.tf b/demos/vit-training-example/tf/platform/main.tf new file mode 100644 index 0000000000..4304fa9dae --- /dev/null +++ b/demos/vit-training-example/tf/platform/main.tf @@ -0,0 +1,72 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +data "google_client_config" "provider" {} + +data "google_container_cluster" "ml_cluster" { + name = var.cluster_name + location = var.zone + depends_on = [module.gke_standard] +} + +provider "google" { + project = var.project_id + region = var.region + zone = var.zone +} + +provider "kubernetes" { + host = data.google_container_cluster.ml_cluster.endpoint + token = data.google_client_config.provider.access_token + cluster_ca_certificate = base64decode( + data.google_container_cluster.ml_cluster.master_auth[0].cluster_ca_certificate + ) +} + +provider "kubectl" { + host = data.google_container_cluster.ml_cluster.endpoint + token = data.google_client_config.provider.access_token + cluster_ca_certificate = base64decode( + data.google_container_cluster.ml_cluster.master_auth[0].cluster_ca_certificate + ) +} + +provider "helm" { + kubernetes { + ##config_path = pathexpand("~/.kube/config") + host = data.google_container_cluster.ml_cluster.endpoint + token = data.google_client_config.provider.access_token + cluster_ca_certificate = base64decode( + data.google_container_cluster.ml_cluster.master_auth[0].cluster_ca_certificate + ) + } +} + +module "gke_standard" { + source = "./modules/gke_standard" + + project_id = var.project_id + region = var.region + zone = var.zone + num_gpu_nodes = var.num_gpu_nodes_in_cluster + cluster_name = var.cluster_name +} + +module "kubernetes" { + source = "./modules/kubernetes" + + depends_on = [module.gke_standard] + region = var.region + cluster_name = var.cluster_name +} diff --git a/demos/vit-training-example/tf/platform/modules/gke_standard/main.tf b/demos/vit-training-example/tf/platform/modules/gke_standard/main.tf new file mode 100644 index 0000000000..33ae191f4b --- /dev/null +++ b/demos/vit-training-example/tf/platform/modules/gke_standard/main.tf @@ -0,0 +1,88 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +provider "google" { + project = var.project_id + region = var.region + zone = var.zone +} + + +# GKE cluster +resource "google_container_cluster" "ml_cluster" { + name = var.cluster_name + location = var.zone + min_master_version = 1.27 + count = 1 + remove_default_node_pool = true + initial_node_count = 1 + + logging_config { + enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"] + } + + workload_identity_config { + workload_pool = "${var.project_id}.svc.id.goog" + } + + addons_config { + gcs_fuse_csi_driver_config { + enabled = true + } + gcp_filestore_csi_driver_config { + enabled = true + } + } + deletion_protection = false +} + +resource "google_container_node_pool" "gpu_pool" { + name = "gpu-pool" + location = var.zone + cluster = google_container_cluster.ml_cluster[0].name + node_count = var.num_gpu_nodes + + management { + auto_repair = "true" + auto_upgrade = "true" + } + + node_config { + oauth_scopes = [ + "https://www.googleapis.com/auth/logging.write", + "https://www.googleapis.com/auth/monitoring", + "https://www.googleapis.com/auth/devstorage.read_only", + "https://www.googleapis.com/auth/trace.append", + "https://www.googleapis.com/auth/service.management.readonly", + "https://www.googleapis.com/auth/servicecontrol", + ] + + image_type = "cos_containerd" + machine_type = "g2-standard-16" + tags = ["gke-node", "${var.project_id}-gke"] + + disk_size_gb = "100" + disk_type = "pd-ssd" + metadata = { + disable-legacy-endpoints = "true" + } + workload_metadata_config { + mode = "GKE_METADATA" + } + guest_accelerator { + type = "nvidia-l4" + count = 1 + } + } +} diff --git a/demos/vit-training-example/tf/platform/modules/gke_standard/output.tf b/demos/vit-training-example/tf/platform/modules/gke_standard/output.tf new file mode 100644 index 0000000000..4c8fb02582 --- /dev/null +++ b/demos/vit-training-example/tf/platform/modules/gke_standard/output.tf @@ -0,0 +1,39 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +output "project_id" { + description = "GCP project id" + value = resource.google_container_cluster.ml_cluster[0].project +} + +output "region" { + description = "GCP region" + value = resource.google_container_cluster.ml_cluster[0].location +} + +output "cluster_name" { + description = "The name of the GKE cluster" + value = resource.google_container_cluster.ml_cluster[0].name +} + +output "kubernetes_host" { + description = "Kubernetes cluster host" + value = resource.google_container_cluster.ml_cluster[0].endpoint +} + +output "cluster_certicicate" { + description = "Kubernetes cluster ca certificate" + value = base64decode(resource.google_container_cluster.ml_cluster[0].master_auth[0].cluster_ca_certificate) + sensitive = true +} diff --git a/demos/vit-training-example/tf/platform/modules/gke_standard/variables.tf b/demos/vit-training-example/tf/platform/modules/gke_standard/variables.tf new file mode 100644 index 0000000000..28bfce8da6 --- /dev/null +++ b/demos/vit-training-example/tf/platform/modules/gke_standard/variables.tf @@ -0,0 +1,47 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +variable "project_id" { + type = string + description = "GCP project id" +} + +variable "region" { + type = string + description = "GCP project region or zone" + default = "us-central1" +} + +variable "zone" { + type = string + description = "GCP project region or zone" + default = "us-central1-c" +} + +variable "cluster_name" { + type = string + description = "GKE cluster name" + default = "ml-cluster" +} + +variable "namespace" { + type = string + description = "Kubernetes namespace where resources are deployed" + default = "example" +} + +variable "num_gpu_nodes" { + description = "Number of GPU nodes in the cluster" + default = 1 +} \ No newline at end of file diff --git a/demos/vit-training-example/tf/platform/modules/gke_standard/versions.tf b/demos/vit-training-example/tf/platform/modules/gke_standard/versions.tf new file mode 100644 index 0000000000..ce2bfd7ee2 --- /dev/null +++ b/demos/vit-training-example/tf/platform/modules/gke_standard/versions.tf @@ -0,0 +1,18 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +terraform { + required_providers { + } +} diff --git a/demos/vit-training-example/tf/platform/modules/kubernetes/kubernetes.tf b/demos/vit-training-example/tf/platform/modules/kubernetes/kubernetes.tf new file mode 100644 index 0000000000..b7c8248aa8 --- /dev/null +++ b/demos/vit-training-example/tf/platform/modules/kubernetes/kubernetes.tf @@ -0,0 +1,22 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +data "http" "nvidia_driver_installer_manifest" { + url = "https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml" +} + +resource "kubectl_manifest" "nvidia_driver_installer" { + yaml_body = data.http.nvidia_driver_installer_manifest.response_body + count = 1 +} diff --git a/demos/vit-training-example/tf/platform/modules/kubernetes/variables.tf b/demos/vit-training-example/tf/platform/modules/kubernetes/variables.tf new file mode 100644 index 0000000000..c05d3709dd --- /dev/null +++ b/demos/vit-training-example/tf/platform/modules/kubernetes/variables.tf @@ -0,0 +1,31 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +variable "region" { + type = string + description = "GCP project region or zone" + default = "us-central1" +} + +variable "cluster_name" { + type = string + description = "Kubernetes cluster name" + default = "ml-cluster" +} + +variable "namespace" { + type = string + description = "Kubernetes namespace where resources are deployed" + default = "example" +} \ No newline at end of file diff --git a/demos/vit-training-example/tf/platform/modules/kubernetes/versions.tf b/demos/vit-training-example/tf/platform/modules/kubernetes/versions.tf new file mode 100644 index 0000000000..7940faa054 --- /dev/null +++ b/demos/vit-training-example/tf/platform/modules/kubernetes/versions.tf @@ -0,0 +1,30 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +terraform { + required_providers { + helm = { + source = "hashicorp/helm" + version = "~> 2.8.0" + } + kubernetes = { + source = "hashicorp/kubernetes" + version = "2.18.1" + } + kubectl = { + source = "alekc/kubectl" + version = "2.0.1" + } + } +} diff --git a/demos/vit-training-example/tf/platform/variables.tf b/demos/vit-training-example/tf/platform/variables.tf new file mode 100644 index 0000000000..e65e1c44f8 --- /dev/null +++ b/demos/vit-training-example/tf/platform/variables.tf @@ -0,0 +1,42 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +variable "project_id" { + type = string + description = "GCP project id" + default = "" +} + +variable "region" { + type = string + description = "GCP project region or zone" + default = "us-central1" +} + +variable "zone" { + type = string + description = "GCP project region or zone" + default = "us-central1-c" +} + +variable "cluster_name" { + type = string + description = "GKE cluster name" + default = "ml-vit-cluster" +} + +variable "num_gpu_nodes_in_cluster" { + description = "Number of GPU nodes in the cluster" + default = 1 +} diff --git a/demos/vit-training-example/tf/platform/versions.tf b/demos/vit-training-example/tf/platform/versions.tf new file mode 100644 index 0000000000..2877998169 --- /dev/null +++ b/demos/vit-training-example/tf/platform/versions.tf @@ -0,0 +1,33 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +terraform { + required_providers { + google = { + source = "hashicorp/google" + } + helm = { + source = "hashicorp/helm" + version = "~> 2.8.0" + } + kubernetes = { + source = "hashicorp/kubernetes" + version = "2.18.1" + } + kubectl = { + source = "alekc/kubectl" + version = "2.0.1" + } + } +} diff --git a/demos/vit-training-example/tf/user/main.tf b/demos/vit-training-example/tf/user/main.tf new file mode 100644 index 0000000000..0d94a24090 --- /dev/null +++ b/demos/vit-training-example/tf/user/main.tf @@ -0,0 +1,45 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +data "google_client_config" "provider" {} + +provider "kubernetes" { + config_path = pathexpand("~/.kube/config") +} + +provider "kubectl" { + config_path = pathexpand("~/.kube/config") +} + +provider "helm" { + kubernetes { + config_path = pathexpand("~/.kube/config") + } +} + +module "kubernetes" { + source = "./modules/kubernetes" + + namespace = var.namespace +} + +module "service_accounts" { + source = "./modules/service_accounts" + + depends_on = [module.kubernetes] + project_id = var.project_id + namespace = var.namespace + service_account = var.service_account + gcs_bucket = var.gcs_bucket +} \ No newline at end of file diff --git a/demos/vit-training-example/tf/user/modules/kubernetes/kubernetes.tf b/demos/vit-training-example/tf/user/modules/kubernetes/kubernetes.tf new file mode 100644 index 0000000000..472e80d418 --- /dev/null +++ b/demos/vit-training-example/tf/user/modules/kubernetes/kubernetes.tf @@ -0,0 +1,20 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +resource "kubernetes_namespace" "ml" { + metadata { + name = var.namespace + } +} + diff --git a/demos/vit-training-example/tf/user/modules/kubernetes/variables.tf b/demos/vit-training-example/tf/user/modules/kubernetes/variables.tf new file mode 100644 index 0000000000..991e581923 --- /dev/null +++ b/demos/vit-training-example/tf/user/modules/kubernetes/variables.tf @@ -0,0 +1,19 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +variable "namespace" { + type = string + description = "Kubernetes namespace where resources are deployed" + default = "example" +} \ No newline at end of file diff --git a/demos/vit-training-example/tf/user/modules/kubernetes/versions.tf b/demos/vit-training-example/tf/user/modules/kubernetes/versions.tf new file mode 100644 index 0000000000..7940faa054 --- /dev/null +++ b/demos/vit-training-example/tf/user/modules/kubernetes/versions.tf @@ -0,0 +1,30 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +terraform { + required_providers { + helm = { + source = "hashicorp/helm" + version = "~> 2.8.0" + } + kubernetes = { + source = "hashicorp/kubernetes" + version = "2.18.1" + } + kubectl = { + source = "alekc/kubectl" + version = "2.0.1" + } + } +} diff --git a/demos/vit-training-example/tf/user/modules/service_accounts/service_accounts.tf b/demos/vit-training-example/tf/user/modules/service_accounts/service_accounts.tf new file mode 100644 index 0000000000..0cab8d06af --- /dev/null +++ b/demos/vit-training-example/tf/user/modules/service_accounts/service_accounts.tf @@ -0,0 +1,61 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +resource "google_service_account" "sa" { + project = "${var.project_id}" + account_id = "${var.service_account}" + display_name = "GCP SA for GCSFuseCSI" +} + +resource "google_service_account_iam_binding" "workload-identity-user" { + service_account_id = google_service_account.sa.name + role = "roles/iam.workloadIdentityUser" + + members = [ + "serviceAccount:${var.project_id}.svc.id.goog[${var.namespace}/${var.k8s_service_account}]", + ] + depends_on = [ google_service_account.sa ] +} + +resource "google_storage_bucket_iam_binding" "gcs-bucket-iam" { + bucket = "${var.gcs_bucket}" + role = "roles/storage.objectAdmin" + members = [ + "serviceAccount:${google_service_account.sa.account_id}@${var.project_id}.iam.gserviceaccount.com", + ] +} + +resource "kubernetes_service_account" "ksa" { + metadata { + name = "${var.k8s_service_account}" + namespace = "${var.namespace}" + } + automount_service_account_token = true +} + +resource "kubernetes_annotations" "ksa-annotations" { + api_version = "v1" + kind = "ServiceAccount" + metadata { + name = "${var.k8s_service_account}" + namespace = "${var.namespace}" + } + annotations = { + "iam.gke.io/gcp-service-account" = "${google_service_account.sa.account_id}@${var.project_id}.iam.gserviceaccount.com" + } + depends_on = [ + kubernetes_service_account.ksa, + google_service_account.sa + ] +} diff --git a/demos/vit-training-example/tf/user/modules/service_accounts/variables.tf b/demos/vit-training-example/tf/user/modules/service_accounts/variables.tf new file mode 100644 index 0000000000..7a77dbfe28 --- /dev/null +++ b/demos/vit-training-example/tf/user/modules/service_accounts/variables.tf @@ -0,0 +1,42 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +variable "project_id" { + type = string + description = "GCP project id" +} + +variable "namespace" { + type = string + description = "Kubernetes namespace where resources are deployed" + default = "example" +} + +variable "service_account" { + type = string + description = "Google Cloud IAM service account for authenticating with GCP services" + default = "gcsfuse-sa" +} + +variable "k8s_service_account" { + type = string + description = "k8s service account" + default = "gcsfuse-ksa" +} + +variable "gcs_bucket" { + type = string + description = "GCS Bucket name" + default = "test-gcsfuse" +} \ No newline at end of file diff --git a/demos/vit-training-example/tf/user/modules/service_accounts/versions.tf b/demos/vit-training-example/tf/user/modules/service_accounts/versions.tf new file mode 100644 index 0000000000..9f32b1d4d6 --- /dev/null +++ b/demos/vit-training-example/tf/user/modules/service_accounts/versions.tf @@ -0,0 +1,26 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +terraform { + required_providers { + google = { + source = "hashicorp/google" + version = "4.56.0" + } + kubernetes = { + source = "hashicorp/kubernetes" + version = "2.18.1" + } + } +} diff --git a/demos/vit-training-example/tf/user/variables.tf b/demos/vit-training-example/tf/user/variables.tf new file mode 100644 index 0000000000..6f7ef1546d --- /dev/null +++ b/demos/vit-training-example/tf/user/variables.tf @@ -0,0 +1,43 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +variable "namespace" { + type = string + description = "Kubernetes namespace where resources are deployed" + default = "example" +} + +variable "project_id" { + type = string + description = "GCP project id" + default = "" +} + +variable "service_account" { + type = string + description = "Google Cloud IAM service account for authenticating with GCP services" + default = "gcsfuse-sa" +} + +variable "k8s_service_account" { + type = string + description = "k8s service account" + default = "gcsfuse-ksa" +} + +variable "gcs_bucket" { + type = string + description = "GCS Bucket name" + default = "" +} diff --git a/demos/vit-training-example/tf/user/versions.tf b/demos/vit-training-example/tf/user/versions.tf new file mode 100644 index 0000000000..2877998169 --- /dev/null +++ b/demos/vit-training-example/tf/user/versions.tf @@ -0,0 +1,33 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +terraform { + required_providers { + google = { + source = "hashicorp/google" + } + helm = { + source = "hashicorp/helm" + version = "~> 2.8.0" + } + kubernetes = { + source = "hashicorp/kubernetes" + version = "2.18.1" + } + kubectl = { + source = "alekc/kubectl" + version = "2.0.1" + } + } +} diff --git a/demos/vit-training-example/yamls/spec-filestore.yaml b/demos/vit-training-example/yamls/spec-filestore.yaml new file mode 100644 index 0000000000..16ccf810f8 --- /dev/null +++ b/demos/vit-training-example/yamls/spec-filestore.yaml @@ -0,0 +1,97 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +apiVersion: apps/v1 +kind: StatefulSet +metadata: + name: tensorflow + namespace: example +spec: + selector: + matchLabels: + pod: tensorflow-pod + serviceName: tensorflow + replicas: 1 + template: + metadata: + labels: + pod: tensorflow-pod + spec: + nodeSelector: + cloud.google.com/gke-accelerator: nvidia-l4 + terminationGracePeriodSeconds: 30 + containers: + - name: tensorflow-container + image: tensorflow/tensorflow:2.13.0-gpu-jupyter + env: + - name: CACHE_DIR + value: "/tf/cache_dir" + volumeMounts: + - name: tensorflow-pvc + mountPath: /tf/cache_dir + resources: + limits: + nvidia.com/gpu: "1" + memory: 30Gi + requests: + nvidia.com/gpu: "1" + memory: 30Gi + volumes: + - name: tensorflow-pvc + persistentVolumeClaim: + claimName: tensorflow-pvc +## Optional: override and set your own token +# env: +# - name: JUPYTER_TOKEN +# value: "jupyter" +--- +kind: PersistentVolumeClaim +apiVersion: v1 +metadata: + name: tensorflow-pvc + namespace: example +spec: + accessModes: + - ReadWriteMany + storageClassName: zonal-rwx + resources: + requests: + storage: 10Ti +--- +# Headless service for the above StatefulSet +apiVersion: v1 +kind: Service +metadata: + name: tensorflow + namespace: example +spec: + ports: + - port: 8888 + clusterIP: None + selector: + pod: tensorflow-pod +--- +# External service +apiVersion: "v1" +kind: "Service" +metadata: + name: tensorflow-jupyter + namespace: example +spec: + ports: + - protocol: "TCP" + port: 80 + targetPort: 8888 + selector: + pod: tensorflow-pod + type: LoadBalancer diff --git a/demos/vit-training-example/yamls/spec-gcs.yaml b/demos/vit-training-example/yamls/spec-gcs.yaml new file mode 100644 index 0000000000..73991a1434 --- /dev/null +++ b/demos/vit-training-example/yamls/spec-gcs.yaml @@ -0,0 +1,94 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +apiVersion: apps/v1 +kind: StatefulSet +metadata: + name: tensorflow + namespace: example +spec: + selector: + matchLabels: + pod: tensorflow-pod + serviceName: tensorflow + replicas: 1 + template: + metadata: + annotations: + gke-gcsfuse/volumes: "true" + gke-gcsfuse/cpu-limit: 500m + gke-gcsfuse/memory-limit: 10Gi + gke-gcsfuse/ephemeral-storage-limit: 10Gi + labels: + pod: tensorflow-pod + spec: + serviceAccountName: + nodeSelector: + cloud.google.com/gke-accelerator: nvidia-l4 + terminationGracePeriodSeconds: 30 + containers: + - name: tensorflow-container + env: + - name: CACHE_DIR + value: "/tf/gcsbucket" + securityContext: + privileged: true + image: tensorflow/tensorflow:2.13.0-gpu-jupyter + volumeMounts: + - name: tensorflow-pvc + mountPath: /tf/gcsbucket + resources: + limits: + nvidia.com/gpu: "1" + memory: 30Gi + requests: + nvidia.com/gpu: "1" + memory: 30Gi + volumes: + - name: tensorflow-pvc + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: +## Optional: override and set your own token +# env: +# - name: JUPYTER_TOKEN +# value: "jupyter" +--- +# Headless service for the above StatefulSet +apiVersion: v1 +kind: Service +metadata: + name: tensorflow + namespace: example +spec: + ports: + - port: 8888 + clusterIP: None + selector: + pod: tensorflow-pod +--- +# External service +apiVersion: "v1" +kind: "Service" +metadata: + name: tensorflow-jupyter + namespace: example +spec: + ports: + - protocol: "TCP" + port: 80 + targetPort: 8888 + selector: + pod: tensorflow-pod + type: LoadBalancer