Skip to content

MPIJob EFA example doesn't apply #19

@kwohlfahrt

Description

@kwohlfahrt

The MPIJob EFA example here, doesn't apply cleanly, it shows the following error:

Error from server (BadRequest): error when creating "mpijob.yaml": MPIJob in version "v2beta1" cannot be handled as a MPIJob: strict decoding error: unknown field "spec.mpiReplicaSpecs.launcher.template.spec.imagePullPolicy", unknown field "spec.mpiReplicaSpecs.worker.template.spec.imagePullPolicy"

The issue is that the imagePullPolicy must be specified on the container, not the spec. Changing it so the scheduler reads like this (and the same for the worker) allows it to apply:

  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
         spec:
          restartPolicy: OnFailure
          containers:
          #- image: <account>.dkr.ecr.us-west-2.amazonaws.com/cuda-efa-nccl-tests:ubuntu18.04
          - image: public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:base-cudnn8-cuda11-ubuntu18.04
            imagePullPolicy: IfNotPresent

Edit: actually, even with this fix, I'm unable to get it running. The connection from the launcher is refused by the worker: Connection reset by 172.17.5.245 port 22.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions