From cloud shell
gcloud config set compute/zone us-central1-f
PROJECT_ID=$(gcloud config get-value project)
CLUSTER_NAME=cluster-1
gcloud container clusters create $CLUSTER_NAME \
--project=$PROJECT_ID \
--release-channel=stable \
--machine-type=n1-standard-4 \
--scopes compute-rw,gke-default,storage-rw \
--num-nodes=3
gcloud container clusters get-credentials $CLUSTER_NAME
TFJob is a component of Kubeflow. It is usually deployed as part of a full Kubeflow installation but can also be used in a standalone configuration. In this lab, you will install TFJob as a standalone component.
TFJob consists of two parts: a Kubernetes custom resource and an operator implementing the job management logic. Kubernetes manifests for both the custom resource definition and the operator are managed in Kubeflow GitHub repository.
Instead of cloning the whole repository you will retrieve the TFJob manifests only using an OSS tool - kpt - that is pre-installed in Cloud Shell.
cd
SRC_REPO=https://github.com/kubeflow/manifests
kpt pkg get $SRC_REPO/[email protected] tf-training
kubectl create namespace kubeflow
kubectl apply --kustomize tf-training/tf-job-crds/base
kubectl apply --kustomize tf-training/tf-job-operator/base
Verify instalation
kubectl get deployments -n kubeflow
Your distributed training environment is ready and you can now prepare and submit distributed training jobs.
The TensorFlow training code and the TFJob manifest template used in the lab can be retrieved from GitHub.
cd
SRC_REPO=https://github.com/GoogleCloudPlatform/mlops-on-gcp
kpt pkg get $SRC_REPO/workshops/mlep-qwiklabs/distributed-training-gke lab-files
cd lab-files
The training module is in the mnist
folder. The model.py
file contains a function to create a simple convolutional network. The main.py
file contains data preprocessing routines and a distributed training loop. Review the files. Notice how you can use a tf.distribute.experimental.MultiWorkerMirrorStrategy()
object to retrieve information about the topology of the distributed cluster running a job.
Before submitting the job, the training code must be packaged in a docker image and pushed into your project's Container Registry. You can find the Dockerfile that creates the image in the lab-files
folder. You do not need to modify the Dockerfile.
To build the image and push it to the registry execute the below commands
IMAGE_NAME=mnist-train
docker build -t gcr.io/${PROJECT_ID}/${IMAGE_NAME} .
docker push gcr.io/${PROJECT_ID}/${IMAGE_NAME}
sample
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: multi-worker
spec:
cleanPodPolicy: None
tfReplicaSpecs:
Worker:
replicas: 3
template:
spec:
containers:
- name: tensorflow
image: mnist
args:
- --epochs=5
- --steps_per_epoch=100
- --per_worker_batch=64
- --saved_model_path=gs://bucket/saved_model_dir
- --checkpoint_path=gs://bucket/checkpoints
As noted in the lab overview, you have a lot of flexibility in defining the job's process topology and allocating hardware resources. Please refer to the TFJob guide for more information.
The key field in the TFJob manifest is tfReplicaSpecs
, which defines the number and the types of replicas (pods) created by a job. In our case, the job will start 3 workers using the container image defined in the image
field and command line arguments defined in the args
field.
Before submitting a job, you need to update the image
and args
fields with the values reflecting your environment.
Use your preferred command line editor or Cloud Shell Editor to update the image
field with a full name of the image you created and pushed to your Container Registry in the previous step. You can retrieve the image name using the following command.
gcloud container images list
should have the following format: gcr.io/<YOUR_PROJECT_ID>/mnist-train
Update manifest with project id, like below
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: multi-worker
spec:
cleanPodPolicy: None
tfReplicaSpecs:
Worker:
replicas: 3
template:
spec:
containers:
- name: tensorflow
image: gcr.io/qwiklabs-gcp-01-93af833e6576/mnist-train
args:
- --epochs=5
- --steps_per_epoch=100
- --per_worker_batch=64
- --saved_model_path=gs://qwiklabs-gcp-01-93af833e6576-bucket/saved_model_dir
- --checkpoint_path=gs://qwiklabs-gcp-01-93af833e6576-bucket/checkpoints
Run it: kubectl apply -f tfjob.yaml
Recall that the job name was specified in the job manifest.
To retrieve logs generated by the training code you can use the kubectl logs
command. Start by listing all pods created by the job.
Check pods status:
kubectl get pods
JOB_NAME=multi-worker
kubectl describe tfjob $JOB_NAME
Check logs of executions: kubectl logs --follow ${JOB_NAME}-worker-0 #or without the --follow flag
Delete everything: kubectl delete tfjob $JOB_NAME
As described in the lab overview, the distributed training script stores training checkpoints and the trained model in the SavedModel format to the storage location passed as one of the script's arguments. You will use a Cloud Storage bucket as a shared persistent storage.
Since storage buckets are a global resource in Google Cloud you have to use a unique bucket name. For the purpose of this lab, you can use your project id as a name prefix.
export TFJOB_BUCKET=${PROJECT_ID}-bucket
gsutil mb gs://${TFJOB_BUCKET}