Skip to content

Commit 635e145

Browse files
rongouk8s-ci-robot
authored andcommitted
update dockerfile and examples to v1alpha2 (#130)
1 parent 6f627a8 commit 635e145

File tree

13 files changed

+174
-89
lines changed

13 files changed

+174
-89
lines changed

Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
FROM golang:1.12.6-alpine3.10 AS build
1+
FROM golang:1.12.7-alpine3.10 AS build
22

33
WORKDIR /go/src/github.com/kubeflow/mpi-operator/
44
COPY . /go/src/github.com/kubeflow/mpi-operator/
5-
RUN go build -o /bin/mpi-operator github.com/kubeflow/mpi-operator/cmd/mpi-operator.v1alpha1
5+
RUN go build -o /bin/mpi-operator github.com/kubeflow/mpi-operator/cmd/mpi-operator.v1alpha2
66

77
FROM alpine:3.10
88
COPY --from=build /bin/mpi-operator /bin/mpi-operator

README.md

Lines changed: 100 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -39,66 +39,127 @@ ks apply ${ENVIRONMENT} -c mpi-operator
3939
Alternatively, you can deploy the operator with default settings without using ksonnet by running the following from the repo:
4040

4141
```shell
42+
kubectl create -f deploy/crd/crd-v1alpha2.yaml
4243
kubectl create -f deploy/
4344
```
4445

4546
## Creating an MPI Job
4647

47-
You can create an MPI job by defining an `MPIJob` config file. See [Tensorflow benchmark example](https://github.com/kubeflow/mpi-operator/blob/master/examples/tensorflow-benchmarks.yaml) config file for launching a multi-node TensorFlow benchmark training job. You may change the config file based on your requirements.
48+
You can create an MPI job by defining an `MPIJob` config file. See [Tensorflow benchmark example](https://github.com/kubeflow/mpi-operator/blob/master/examples/v1alpha2/tensorflow-benchmarks.yaml) config file for launching a multi-node TensorFlow benchmark training job. You may change the config file based on your requirements.
4849

4950
```
50-
cat examples/tensorflow-benchmarks.yaml
51+
cat examples/v1alpha2/tensorflow-benchmarks.yaml
5152
```
5253
Deploy the `MPIJob` resource to start training:
5354

5455
```
55-
kubectl create -f examples/tensorflow-benchmarks.yaml
56+
kubectl create -f examples/v1alpha2/tensorflow-benchmarks.yaml
5657
```
5758

5859
## Monitoring an MPI Job
5960

6061
Once the `MPIJob` resource is created, you should now be able to see the created pods matching the specified number of GPUs. You can also monitor the job status from the status section. Here is sample output when the job is successfully completed.
6162

6263
```
63-
kubectl get -o yaml mpijobs tensorflow-benchmarks-16
64+
kubectl get -o yaml mpijobs tensorflow-benchmarks
6465
```
6566

6667
```
67-
apiVersion: kubeflow.org/v1alpha1
68+
apiVersion: kubeflow.org/v1alpha2
6869
kind: MPIJob
6970
metadata:
70-
clusterName: ""
71-
creationTimestamp: 2019-01-07T20:32:12Z
71+
creationTimestamp: "2019-07-09T22:15:51Z"
7272
generation: 1
73-
name: tensorflow-benchmarks-16
73+
name: tensorflow-benchmarks
7474
namespace: default
75-
resourceVersion: "185051397"
76-
selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/mpijobs/tensorflow-benchmarks-16
77-
uid: 8dc8c044-127d-11e9-a419-02420bbe29f3
75+
resourceVersion: "5645868"
76+
selfLink: /apis/kubeflow.org/v1alpha2/namespaces/default/mpijobs/tensorflow-benchmarks
77+
uid: 1c5b470f-a297-11e9-964d-88d7f67c6e6d
7878
spec:
79-
gpus: 16
80-
template:
81-
metadata:
82-
creationTimestamp: null
83-
spec:
84-
containers:
85-
- image: mpioperator/tensorflow-benchmarks:latest
86-
name: tensorflow-benchmarks
87-
resources: {}
79+
cleanPodPolicy: Running
80+
mpiReplicaSpecs:
81+
Launcher:
82+
replicas: 1
83+
template:
84+
spec:
85+
containers:
86+
- command:
87+
- mpirun
88+
- --allow-run-as-root
89+
- -np
90+
- "2"
91+
- -bind-to
92+
- none
93+
- -map-by
94+
- slot
95+
- -x
96+
- NCCL_DEBUG=INFO
97+
- -x
98+
- LD_LIBRARY_PATH
99+
- -x
100+
- PATH
101+
- -mca
102+
- pml
103+
- ob1
104+
- -mca
105+
- btl
106+
- ^openib
107+
- python
108+
- scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
109+
- --model=resnet101
110+
- --batch_size=64
111+
- --variable_update=horovod
112+
image: mpioperator/tensorflow-benchmarks:latest
113+
name: tensorflow-benchmarks
114+
Worker:
115+
replicas: 1
116+
template:
117+
spec:
118+
containers:
119+
- image: mpioperator/tensorflow-benchmarks:latest
120+
name: tensorflow-benchmarks
121+
resources:
122+
limits:
123+
nvidia.com/gpu: 2
124+
slotsPerWorker: 2
88125
status:
89-
launcherStatus: Succeeded
126+
completionTime: "2019-07-09T22:17:06Z"
127+
conditions:
128+
- lastTransitionTime: "2019-07-09T22:15:51Z"
129+
lastUpdateTime: "2019-07-09T22:15:51Z"
130+
message: MPIJob default/tensorflow-benchmarks is created.
131+
reason: MPIJobCreated
132+
status: "True"
133+
type: Created
134+
- lastTransitionTime: "2019-07-09T22:15:54Z"
135+
lastUpdateTime: "2019-07-09T22:15:54Z"
136+
message: MPIJob default/tensorflow-benchmarks is running.
137+
reason: MPIJobRunning
138+
status: "False"
139+
type: Running
140+
- lastTransitionTime: "2019-07-09T22:17:06Z"
141+
lastUpdateTime: "2019-07-09T22:17:06Z"
142+
message: MPIJob default/tensorflow-benchmarks successfully completed.
143+
reason: MPIJobSucceeded
144+
status: "True"
145+
type: Succeeded
146+
replicaStatuses:
147+
Launcher:
148+
succeeded: 1
149+
Worker: {}
150+
startTime: "2019-07-09T22:15:51Z"
90151
```
91152

92153

93154
Training should run for 100 steps and takes a few minutes on a GPU cluster. You can inspect the logs to see the training progress. When the job starts, access the logs from the `launcher` pod:
94155

95156
```
96-
PODNAME=$(kubectl get pods -l mpi_job_name=tensorflow-benchmarks-16,mpi_role_type=launcher -o name)
157+
PODNAME=$(kubectl get pods -l mpi_job_name=tensorflow-benchmarks,mpi_role_type=launcher -o name)
97158
kubectl logs -f ${PODNAME}
98159
```
99160

100161
```
101-
TensorFlow: 1.10
162+
TensorFlow: 1.14
102163
Model: resnet101
103164
Dataset: imagenet (synthetic)
104165
Mode: training
@@ -108,32 +169,29 @@ Batch size: 128 global
108169
Num batches: 100
109170
Num epochs: 0.01
110171
Devices: ['horovod/gpu:0', 'horovod/gpu:1']
172+
NUMA bind: False
111173
Data format: NCHW
112174
Optimizer: sgd
113175
Variables: horovod
114176
115177
...
116178
117-
40 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.146
118-
40 images/sec: 132.1 +/- 0.0 (jitter = 0.1) 9.182
119-
50 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.071
120-
50 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.210
121-
60 images/sec: 132.2 +/- 0.0 (jitter = 0.2) 9.180
122-
60 images/sec: 132.2 +/- 0.0 (jitter = 0.2) 9.055
123-
70 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.005
124-
70 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.096
125-
80 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.231
126-
80 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.197
127-
90 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.201
128-
90 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.089
129-
100 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.183
130-
----------------------------------------------------------------
131-
total images/sec: 264.26
132-
----------------------------------------------------------------
133-
100 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.044
134-
----------------------------------------------------------------
135-
total images/sec: 264.26
179+
40 images/sec: 154.4 +/- 0.7 (jitter = 4.0) 8.280
180+
40 images/sec: 154.4 +/- 0.7 (jitter = 4.1) 8.482
181+
50 images/sec: 154.8 +/- 0.6 (jitter = 4.0) 8.397
182+
50 images/sec: 154.8 +/- 0.6 (jitter = 4.2) 8.450
183+
60 images/sec: 154.5 +/- 0.5 (jitter = 4.1) 8.321
184+
60 images/sec: 154.5 +/- 0.5 (jitter = 4.4) 8.349
185+
70 images/sec: 154.5 +/- 0.5 (jitter = 4.0) 8.433
186+
70 images/sec: 154.5 +/- 0.5 (jitter = 4.4) 8.430
187+
80 images/sec: 154.8 +/- 0.4 (jitter = 3.6) 8.199
188+
80 images/sec: 154.8 +/- 0.4 (jitter = 3.8) 8.404
189+
90 images/sec: 154.6 +/- 0.4 (jitter = 3.7) 8.418
190+
90 images/sec: 154.6 +/- 0.4 (jitter = 3.6) 8.459
191+
100 images/sec: 154.2 +/- 0.4 (jitter = 4.0) 8.372
192+
100 images/sec: 154.2 +/- 0.4 (jitter = 4.0) 8.542
136193
----------------------------------------------------------------
194+
total images/sec: 308.27
137195
```
138196

139197
# Docker Images

cmd/kubectl-delivery/Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
1-
FROM alpine:3.8 AS build
1+
FROM alpine:3.10 AS build
22

33
# Install kubectl.
4-
ENV K8S_VERSION v1.13.2
4+
ENV K8S_VERSION v1.15.0
55
RUN apk add --no-cache wget
66
RUN wget -q https://storage.googleapis.com/kubernetes-release/release/${K8S_VERSION}/bin/linux/amd64/kubectl
77
RUN chmod +x ./kubectl
88
RUN mv ./kubectl /bin/kubectl
99

10-
FROM alpine:3.8
10+
FROM alpine:3.10
1111
COPY --from=build /bin/kubectl /bin/kubectl
1212
COPY deliver_kubectl.sh .
1313
ENTRYPOINT ["./deliver_kubectl.sh"]

cmd/mpi-operator.v1alpha2/app/options/options.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,8 +50,8 @@ func (s *ServerOption) AddFlags(fs *flag.FlagSet) {
5050
"The container image used to deliver the kubectl binary.")
5151

5252
fs.StringVar(&s.Namespace, "namespace", v1.NamespaceAll,
53-
`The namespace to monitor tfjobs. If unset, it monitors all namespaces cluster-wide.
54-
If set, it only monitors tfjobs in the given namespace.`)
53+
`The namespace to monitor mpijobs. If unset, it monitors all namespaces cluster-wide.
54+
If set, it only monitors mpijobs in the given namespace.`)
5555

5656
fs.IntVar(&s.Threadiness, "threadiness", 2,
5757
`How many threads to process the main logic`)

deploy/2-rbac.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,14 @@ rules:
2626
- pods/exec
2727
verbs:
2828
- create
29+
- apiGroups:
30+
- ""
31+
resources:
32+
- endpoints
33+
verbs:
34+
- create
35+
- get
36+
- update
2937
- apiGroups:
3038
- ""
3139
resources:
@@ -80,6 +88,7 @@ rules:
8088
- kubeflow.org
8189
resources:
8290
- mpijobs
91+
- mpijobs/status
8392
verbs:
8493
- "*"
8594
---

deploy/3-mpi-operator.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@ spec:
2121
image: mpioperator/mpi-operator:latest
2222
args: [
2323
"-alsologtostderr",
24-
"--gpus-per-node", "8",
2524
"--kubectl-delivery-image",
2625
"mpioperator/kubectl-delivery:latest"
2726
]
Lines changed: 3 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,8 @@
1-
FROM uber/horovod:0.15.2-tf1.12.0-torch1.0.0-py2.7
2-
3-
# Temporary fix until Horovod pushes out a new release.
4-
# See https://github.com/uber/horovod/pull/700
5-
RUN sed -i '/^NCCL_SOCKET_IFNAME.*/d' /etc/nccl.conf
1+
FROM horovod/horovod:0.16.4-tf1.14.0-torch1.1.0-mxnet1.4.1-py3.6
62

73
RUN mkdir /tensorflow
84
WORKDIR "/tensorflow"
9-
RUN git clone -b cnn_tf_v1.12_compatible https://github.com/tensorflow/benchmarks
5+
RUN git clone https://github.com/tensorflow/benchmarks
106
WORKDIR "/tensorflow/benchmarks"
117

12-
CMD mpirun \
13-
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
14-
--model resnet101 \
15-
--batch_size 64 \
16-
--variable_update horovod
8+
CMD ["/bin/bash"]
File renamed without changes.

0 commit comments

Comments
 (0)