generated from amazon-archives/__template_MIT-0
-
Notifications
You must be signed in to change notification settings - Fork 34
Open
Description
The MPIJob EFA example here, doesn't apply cleanly, it shows the following error:
Error from server (BadRequest): error when creating "mpijob.yaml": MPIJob in version "v2beta1" cannot be handled as a MPIJob: strict decoding error: unknown field "spec.mpiReplicaSpecs.launcher.template.spec.imagePullPolicy", unknown field "spec.mpiReplicaSpecs.worker.template.spec.imagePullPolicy"
The issue is that the imagePullPolicy must be specified on the container, not the spec. Changing it so the scheduler reads like this (and the same for the worker) allows it to apply:
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
restartPolicy: OnFailure
containers:
#- image: <account>.dkr.ecr.us-west-2.amazonaws.com/cuda-efa-nccl-tests:ubuntu18.04
- image: public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:base-cudnn8-cuda11-ubuntu18.04
imagePullPolicy: IfNotPresentEdit: actually, even with this fix, I'm unable to get it running. The connection from the launcher is refused by the worker: Connection reset by 172.17.5.245 port 22.
Metadata
Metadata
Assignees
Labels
No labels