Implemenet MPI Plugin for OpenMPI #2493

tenzen-y · 2025-03-10T12:35:22Z

What this PR does / why we need it:

I have done the following things to support OpenMPI:

Added MPI policies propagation and resource constructure mechanism to MPI plugin.
Added a mechanism to obtain runtime template (.spec.template) from runtime.Info in each plugin.
Added PodNetworkPlugin to KF Pipeline Framework to identify the Pod Endpoints like test-job-trainer-node-0-0.test-job.
Introduced PodSet internal data structure to runtime package to represent the arbitrary type of Jobs (Initializer, launchere) as opposed to Trainer. However, those are used only for MPI plugin for now. Other plugins keep using the previous Trainer date structure until those are removed completely (Migrate Trainer to PodSet and RuntimePolicy in runtime package (InternalAPI) #2495)

Furthermore, I quickly verified the OpenMPI workload behavior by the following TrainJob and manifests/base/runtimes/mpi_distributed.yaml.

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: mpirun
  namespace: default
spec:
  runtimeRef:
    name: mpi-distributed

$ kubectl get pod
NAME                        READY   STATUS    RESTARTS   AGE
mpirun-launcher-0-0-wvmpl       1/1     Running             1 (1s ago)   2s
mpirun-trainer-node-0-0-52jxr   1/1     Running             0          1s
mpirun-trainer-node-0-1-czdb9   1/1     Running             0          1s
$
$ kubectl logs mpirun-launcher-0-0-wvmpl 
Warning: Permanently added '[mpirun-trainer-node-0-0.mpirun]:2222' (ECDSA) to the list of known hosts.
Warning: Permanently added '[mpirun-trainer-node-0-1.mpirun]:2222' (ECDSA) to the list of known hosts.
Workers: 2
Rank 0 on host mpirun-trainer-node-0-0
Rank 1 on host mpirun-trainer-node-0-1
pi is approximately 3.1410376000000002

Additionally, we will work on the below items in the follow-up PRs:

Migrate Trainer to PodSet and RuntimeInfo for all plugins in runtime package: Migrate Trainer to PodSet and RuntimePolicy in runtime package (InternalAPI) #2495
Support all Volume types in the ApplyConfigurations: Support All kind of VolumeSource for ApplyConfiguration #2494
UnitTests for Apply package: Add unit tests that cover the pkg/apply package #2452
E2E tests: Implement E2E tests for OpenMPI on ControlPlane side #2496
Update the Kubeflow Pipeline Framework Diagram and Description: Update Kubeflow Pipeline Framework Diagram and Description with PodNetworkPlugin #2497
Add replica validations: Add replicatedJobs.replicas validations to TrainingRuntime and ClusterTrainingRuntime Webhook #2502

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Docs included if any changes are user facing

tenzen-y · 2025-03-10T15:22:29Z

/hold
I have implemented OpenMPI workload supports. Please take another look. Thanks.
@kubeflow/wg-training-leads @astefanutti

Signed-off-by: Yuki Iwai <[email protected]>

astefanutti · 2025-03-11T11:20:49Z

pkg/runtime/core/trainingruntime.go

@@ -103,26 +108,74 @@ func (r *TrainingRuntime) buildObjects(
 		runtime.WithAnnotations(propagationAnnotations),
 		runtime.WithMLPolicy(mlPolicy),
 		runtime.WithPodGroupPolicy(podGroupPolicy),
+		runtime.WithTemplateSpecObjApply[jobsetv1alpha2ac.JobSetSpecApplyConfiguration](&jobsetv1alpha2.JobSet{


I wonder if loading the TrainingRuntime as apply configuration directly would simplify things and remove the need for converting things like Volume to VolumeApplyConfiguration.

Yeah, that is a good point. Actually, I was thinking about it a bit.
However, we will add another CRD as runtime like SingleRoleTrainingRuntime based on batch/v1 Job.

Ah, does that mean we can pass .spec.template (JobSetTemplateSpec) as ApplyConfiguration to WithTemplateSpecObjApply, right?

In that case, we can reduce parser from WithTemplateSpecObjApply function. That makes sense. Thanks.

However, we will add another CRD as runtime like SingleRoleTrainingRuntime based on batch/v1 Job.

I am not sure if that is needed, until we really hear use-cases when JobSet won't work for users.

However, we will add another CRD as runtime like SingleRoleTrainingRuntime based on batch/v1 Job.

I am not sure if that is needed, until we really hear use-cases when JobSet won't work for users.

As I mentioned above, the unstructured parse mechanism will be removed. The SingleRoleTrainingRuntime was just example.

andreyvelich

Thank you for this great work @tenzen-y!
I left my initial comments, will take a look again in a few hours.

andreyvelich · 2025-03-11T11:46:38Z

manifests/base/runtimes/pretraining/mpi_distributed.yaml

@@ -7,39 +7,51 @@ metadata:
    trainer.kubeflow.org/phase: pre-training
 spec:
  mlPolicy:
-    numNodes: 1
+    numNodes: 2


Should we keep 1 Node by default ?

Suggested change

numNodes: 2

numNodes: 1

Yeah, sure.

andreyvelich · 2025-03-11T11:47:29Z

manifests/base/runtimes/pretraining/mpi_distributed.yaml

    mpi:
      numProcPerNode: 1
      mpiImplementation: OpenMPI
-      sshAuthMountPath: /root/.ssh
+      sshAuthMountPath: /home/mpiuser/.ssh


Should we keep the default path ?

Suggested change

sshAuthMountPath: /home/mpiuser/.ssh

sshAuthMountPath: /root/.ssh

Currently, we support only non root users. The root user support requires additional enhancement.

In that case, do we need to change the default mount path for now to /home/mpiuser/.ssh ?
Also, does it mean that having /home/mpiuser directory is a requirement to use MPI runtime with Kubeflow Trainer?

/home/mpiuser

They can use arbitratry USER insted of mpiuser. The mpiuser is a specification for the pi container image.

manifests/base/runtimes/pretraining/mpi_distributed.yaml

andreyvelich · 2025-03-11T11:49:34Z

manifests/base/runtimes/pretraining/mpi_distributed.yaml

        - name: trainer-node
-          dependsOn:


Why did you remove dependsOn in trainer should start after launcher is ready ?

This causes a deadlock because the launcher keeps a crash-and-restart loop until trainer-nodes endpoint is healthy (≠PodReady).

I see, do we see any problems when trainer node Job starts before the launcher ?
I thought one of the goals for StartupPolicy/DependsOn API was the MPI use-case.

I see, do we see any problems when trainer node Job starts before the launcher ?

This is implemented as an optional parameter (WaitForWorkersReady) for MPIJob v2beta1:

https://github.com/kubeflow/mpi-operator/blob/7f94988ab1d27fb46c69994e538543ef0e115589/pkg/apis/kubeflow/v2beta1/types.go#L194-L197

The Gang Scheduling problem is the reason why we use AtStartup as a default.

I see, so with PodGroupPolicy we don't really need it.
In the current implementation, won't launcher fail to run mpirun if some of the workers are not Ready ?
Since by default we don't use PodGroupPolicy and Gang Scheduling.

In the current implementation, won't launcher fail to run mpirun if some of the workers are not Ready ?
Since by default we don't use PodGroupPolicy and Gang Scheduling.

That is expected behavior. The launcher keeps restarting the Node (=Pod) endpoint is ready.

andreyvelich · 2025-03-11T11:51:54Z

pkg/constants/constants.go

+	OpenMPIEnvKeepFQDNHostNames string = "OMPI_MCA_orte_keep_fqdn_hostnames"
+
+	// OpenMPIEnvDefaultSlots is the OpenMPI default number of slots env key.
+	OpenMPIEnvDefaultSlots string = "OMPI_MCA_orte_set_default_slots"


Do we need this env variable for OpenMPI if the host file always sets the appropriate number of slots using NumProcPerNode value ?

This env variable is defined as part of OpenMPI implementation, and OpenMPI is not only for mpirun.
So, it would be better to provide an environment variable to align with OpenMPI specification.

Sounds good.

andreyvelich · 2025-03-11T12:21:17Z

pkg/runtime/runtime.go

+	Name string
+	// If Name is trainer-node, CountForNonTrainer is null.
+	// For Trainer, PodSet Count should be stored in Info.RuntimePolicy.MLPolicy.NumNodes.
+	CountForNonTrainer *int32


Why do we need it if we are NOT planning to support more than 1 Job replica for other other Jobs (e.g. Initializer, Launcher).

We actually support it here. The key point is that we have 3 type of replication, numNodes, batchv1.Job.parallelism, and JobSet.replicatedJob.Replica.

This CountForNonTrainer corresponds to batchv1.Job.parallelism.

Do we want to support batchv1.Job.parallelism != 1 for non Trainer Node jobs ?

They can specify arbitrary roles except for trainer-node, initializer, and launcher. And it would be better to just propagate the parallelism parameters to JobSet.

I see makes sense.
I am wondering whether we should consolidate NumNodes to the PodCount as well ?
Should we just override this value in the MLPolicy plugins based on trainJob.spec.numNodes and trainingRuntime.spec.mlPolicy.numNodes

I am wondering whether we should consolidate NumNodes to the PodCount as well ?
Should we just override this value in the MLPolicy plugins based on trainJob.spec.numNodes and trainingRuntime.spec.mlPolicy.numNodes

Yeah, I was thinking the same thing. However, if we want to do it, we need to change Info.RuntimePolicy data structure, and it will affect all plugin implementation. So, let me revisit if we should do it after #2495

If we make a lot of data structure changes in a single scope, that causes bugs, IMO.

Sounds great! It would be nice to make a note about it in the #2495

Done.

After the above migration, we want to consider dropping NumNodes from Info.RuntimePolicy.MLPolicy and then fully rely on the PodSet data structure in case of Trainer as well.

andreyvelich · 2025-03-11T12:22:23Z

pkg/runtime/core/trainingruntime.go

@@ -103,26 +108,74 @@ func (r *TrainingRuntime) buildObjects(
 		runtime.WithAnnotations(propagationAnnotations),
 		runtime.WithMLPolicy(mlPolicy),
 		runtime.WithPodGroupPolicy(podGroupPolicy),
+		runtime.WithTemplateSpecObjApply[jobsetv1alpha2ac.JobSetSpecApplyConfiguration](&jobsetv1alpha2.JobSet{


However, we will add another CRD as runtime like SingleRoleTrainingRuntime based on batch/v1 Job.

I am not sure if that is needed, until we really hear use-cases when JobSet won't work for users.

andreyvelich · 2025-03-11T12:24:36Z

pkg/runtime/runtime.go

+var (
+	errorTemplateSpecPathNotFound = errors.New("template spec path not found")
+
+	defaultPodSetsSyncer = func(*Info) {}


How this sync is used ?

Here is one of sync flow:

Register callback function:

trainer/pkg/runtime/core/trainingruntime.go

Line 125 in b938a39

runtime.WithPodSetSyncer(syncPodSets),

Obtainer syncer:

trainer/pkg/runtime/runtime.go

Line 224 in b938a39

func (i *Info) SyncPodSetsToTemplateSpec() {

Sync PodSpec to TemplateSpecApplyConfiguration:

trainer/pkg/runtime/framework/plugins/mpi/mpi.go

Line 199 in b938a39

info.SyncPodSetsToTemplateSpec()

andreyvelich · 2025-03-11T12:44:09Z

pkg/runtime/framework/core/framework.go

@@ -112,6 +116,15 @@ func (f *Framework) RunCustomValidationPlugins(oldObj, newObj *trainer.TrainJob)
 	return aggregatedWarnings, aggregatedErrors
 }

+func (f *Framework) RunPodNetworkPlugins(info *runtime.Info, trainJob *trainer.TrainJob) error {


Do we need separate plugin for the Pod Network ?
I can't imagine a use-case when this might be different.

I am curious if it is only applied for MPI, should we couple it with MPI plugin for now ?

trainer/pkg/runtime/framework/plugins/mpi/mpi.go

Lines 291 to 293 in 8282e74

for e := range ps.Endpoints {

hostFile.WriteString(fmt.Sprintf("%s slots=%d\n", e, slots))

}

In the future, if we see a need for other use-cases when we want to generate endpoints for every Pod, we can refactor it.

I am curious if it is only applied for MPI, should we couple it with MPI plugin for now ?

The endpoint pattern and construction mechanism deeply depend on JobSet functionality (subdomain, etc.) and the JobSet replication parameters. So, MPI plugin can not construct the endpoint list.

You are right, so we might need to update the Pipeline Framework to introduce this plugin: https://github.com/kubeflow/trainer/tree/master/docs/proposals/2170-kubeflow-training-v2#pipeline-framework.

I am curious, if in the future we are going to see more decencies between components (e.g. MPI <-> JobSet), do we need to introduce more plugins into Pipeline Framework ?

I guess, the main goal is to have the correct Info object before we run the ComponentBuilder() plugin.

You are right, so we might need to update the Pipeline Framework to introduce this plugin: https://github.com/kubeflow/trainer/tree/master/docs/proposals/2170-kubeflow-training-v2#pipeline-framework.

This will be done as a part of #2497

I am curious, if in the future we are going to see more decencies between components (e.g. MPI <-> JobSet), do we need to introduce more plugins into Pipeline Framework ?

In that case, we can prepare more comprehensive interface like JobInfrastructure something, and then add multiple functions like Network, Volume and so on. However, for now, a single Network interface should be enough.

andreyvelich · 2025-03-11T12:47:48Z

pkg/runtime/framework/plugins/jobset/jobset.go

+		rJobReplicas := 1
+		info.TemplateSpec.PodSets[rJobIdx].Endpoints = func(yield func(string) bool) {
+			for podIdx := range ptr.Deref(podCount, 1) {
+				endpoint := fmt.Sprintf("%s-%s-%d-%d.%s", trainJob.Name, *rJob.Name, rJobReplicas-1, podIdx, subDomain)


For the MPI use-case, do we need separate endpoints for Launcher and Trainer jobs ?

We need to record all Node (=Pod) endpoints to hostfile.

Should we exclude the Initializer Job from the PodNetwork ?

Should we exclude the Initializer Job from the PodNetwork ?

It would be better to avoid distinguishing between Initializer and others when constructing the Info object since, from outside of the JobSet plugin's POV, they can not understand why Info.PodSet.Endpoints is missing in the case of some roles.

Yeah, I think that makes a lot of sense! And we only iterate over Launcher + Trainer endpoints in the MPI plugin.

Yes
Additionally, PodSet.Endpoints is an iterator. So, we can reduce time complexity to calculate the endpoint in case of large Jobs.

andreyvelich · 2025-03-11T13:48:35Z

pkg/apply/apply.go

+		cmSource.WithItems(cmItems...)
+		vol.WithConfigMap(cmSource)
+	}
+	// TODO: Add other volume sources


Wouldn't it be tricky for us to support all volumes in the Apply configuration script ?

Could you clarify "tricky for us"?

Why do we need to support more volumes in this API ?
We don't really call this outside of MPI context when we need to create Secret + ConfigMap volume.
Or I am missing smth ?

Actually, this will be used to parse PodSpec under the JobSetSpec. So, this was needed.

However based on #2493 (comment) discussion, we will remove the object parse mechanism. So, we can remove this Volume function, completely.

Thanks.

I see, so do we want to address it in this PR or in the followup changes ?

I see, so do we want to address it in this PR or in the followup changes ?

I am currently working on it in this PR. I will ping after I have finalized migration.

andreyvelich · 2025-03-11T13:58:24Z

pkg/runtime/framework/plugins/mpi/mpi.go

+								corev1ac.KeyToPath().
+									WithKey(constants.MPIHostfileName).
+									WithPath(constants.MPIHostfileName).
+									WithMode(0444),


If filename is equal to the ConfigMap data key, do we need to set Path ?

IIUC, we can omit it in that case. But this approach is more declarative, isn't it?

pkg/runtime/framework/plugins/mpi/mpi.go

andreyvelich · 2025-03-11T14:06:24Z

pkg/runtime/framework/plugins/plainml/plainml.go

@@ -58,7 +58,7 @@ func (p *PlainML) EnforceMLPolicy(info *runtime.Info, trainJob *trainer.TrainJob



Are we planning to change other plugins in the followup PRs ?
E.g. insert NumNodes to the info.RuntimePolicy.MLPolicy.NumNodes ?

As we discussed in previously, I prioritized this MPI enhancement instead of stabilize plugin mechanism.
So, the whole of refactor will be performed in the future PRs as part of #2495.

andreyvelich · 2025-03-11T14:08:40Z

pkg/runtime/framework/plugins/mpi/mpi.go

-	if err != nil {
-		return nil, fmt.Errorf("failed to build Secret with SSH auth keys. Error: %v", err)
-	}
+	var objects []any


Why do you want to create this variable to just append secret ?

I might not catch your point. Could you clarify? This is because Secret is created only when Secret does not exist.

Oh, makes sense due to this comment: #2493 (comment)

andreyvelich · 2025-03-11T14:22:55Z

pkg/runtime/framework/core/framework.go

@@ -112,6 +116,15 @@ func (f *Framework) RunCustomValidationPlugins(oldObj, newObj *trainer.TrainJob)
 	return aggregatedWarnings, aggregatedErrors
 }

+func (f *Framework) RunPodNetworkPlugins(info *runtime.Info, trainJob *trainer.TrainJob) error {


I am curious if it is only applied for MPI, should we couple it with MPI plugin for now ?

trainer/pkg/runtime/framework/plugins/mpi/mpi.go

Lines 291 to 293 in 8282e74

for e := range ps.Endpoints {

hostFile.WriteString(fmt.Sprintf("%s slots=%d\n", e, slots))

}

In the future, if we see a need for other use-cases when we want to generate endpoints for every Pod, we can refactor it.

andreyvelich · 2025-03-11T14:26:48Z

pkg/runtime/framework/plugins/jobset/builder.go

@@ -87,29 +93,58 @@ func (b *Builder) Initializer(trainJob *trainer.TrainJob) *Builder {
 }

 // Launcher updates JobSet values for the launcher Job.
-func (b *Builder) Launcher(info *runtime.Info, trainJob *trainer.TrainJob) *Builder {
+func (b *Builder) Launcher() *Builder {


Why do we need Launcher builder in JobSet plugins if we just assign the "1" to the Job Replicas ?
Would it be simpler to just validate in webhook that launcher ReplicatedJob has replicas: 1?

Yeah, that's right. We should add webhook validators for all reserved roles, Trainer, Initializer, and Launcher if replicatedJobs[*].replicas is 1. However, I want to avoid conflicts with @akshaychitneni impls, and it sounds beyond the scope of this PR since we need to validate all reserved roles rather than just the Launcher.

So, I would like a follow-up.

I opened an follow-up issue: #2502

andreyvelich · 2025-03-11T14:31:14Z

pkg/runtime/framework/plugins/jobset/jobset.go

 	// Init the JobSet apply configuration from the runtime template spec
 	jobSetBuilder := NewBuilder(jobsetv1alpha2ac.JobSet(trainJob.Name, trainJob.Namespace).
 		WithLabels(maps.Clone(info.Labels)).
 		WithAnnotations(maps.Clone(info.Annotations)).
-		WithSpec(jobSetTemplateSpec))
+		WithSpec(jobSetSpec))


Should we apply jobSetSpec to the JobSet spec after we run the JobSet builder ?
Since Plugins always contains the final values for JobSet even if TrainJob overrides it.
For example, if webhook is disabled and if user accidentally configures PET_NNODES envs in the TrainJob, we have to override it according to the Torch plugin.

Yeah, that sounds reasonable. Ideally, we want to propagate information by Info.TemplateSpec.PodSet -> TrainJob -> TrainingRuntime. In this case , input TrainingRuntime will be stored in Info.TemplateSpec.ObjApply as immutable, and each plugin update only Info.TemplateSpec.PodSet.

But, here problem is Info.Trainer. So, I would like to do more refactor in the follow-up PR.

Sounds good.

andreyvelich · 2025-03-11T14:33:15Z

pkg/runtime/framework/plugins/jobset/jobset_test.go

+// TODO: Add tests for all Interfaces.
+// REF: https://github.com/kubeflow/trainer/issues/2468
+
+func TestJobSet(t *testing.T) {


Do we have any duplicated tests in Plugins and TrainingRuntime ?
https://github.com/kubeflow/trainer/blob/8282e748c78b700d0682f91778daeb1bed66f6d2/pkg/runtime/core/trainingruntime_test.go

Duplicated tests will be removed as part of #2468 in the future PR.

andreyvelich · 2025-03-11T14:35:01Z

pkg/runtime/runtime_test.go

+														WithVolumeMounts(
+															corev1ac.VolumeMount().
+																WithName(jobsetplgconsts.VolumeNameInitializer).
+																WithMountPath("/workspace/dataset"),


We might need to store /workspace/dataset and /workspace/model in the constants since it is default paths to store dataset and model when our initializer is used.

Sure, let me try it.

andreyvelich · 2025-03-11T23:34:15Z

cc MPI Operator folks in case you want to check out this PR
/cc @kannon92 @vsoch @kuizhiqing @alculquicondor @roteme-runai @mchmarny @mlsorensen @Syulin7

google-oss-prow · 2025-03-11T23:34:20Z

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: mchmarny, mlsorensen, kannon92, vsoch, roteme-runai.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

cc MPI Operator folks in case you want to check out this PR
/cc @kannon92 @vsoch @kuizhiqing @alculquicondor @roteme-runai @mchmarny @mlsorensen @Syulin7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: Yuki Iwai <[email protected]>

kannon92 · 2025-03-12T12:45:33Z

manifests/base/runtimes/pretraining/mpi_distributed.yaml

                  containers:
                    - name: launcher
-                      image: busybox
+                      image: mpioperator/mpi-pi:openmpi


Are these image published from kubeflow?

Yes, here is the Dockerfile: https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1/pi

Does kubeflow have an official repository for this? Otherwise this hits dockerhub and you may get rate limited if you brought this in the CI.

That is a good point, actually even with official repository we will hit pull rate limits for non-auth requests, more info here: kubeflow/manifests#3010
We are working to migrate our images to the GHCR: #2491
I think, we should do the same for MPI Operator images.
cc @kubeflow/wg-training-leads

I will open an issue for that in mpi-operator repository.

This one: kubeflow/mpi-operator#684

vsoch · 2025-03-12T14:21:57Z

So I understand - you are going to do this for every MPI implementation? Is that sustainable for developers?

astefanutti · 2025-03-12T14:44:15Z

pkg/apply/apply.go

+	if err != nil {
+		return nil, err
+	}
+	templateSpec, ok, err := unstructured.NestedFieldCopy(u, fields...)


Maybe change to variable name so it's generic?

Oh, good catch

astefanutti · 2025-03-12T14:45:25Z

pkg/apply/apply.go

+		return nil, err
+	}
+	if !ok {
+		return nil, fmt.Errorf("%w: '.%s'", errorTemplateSpecPathNotFound, strings.Join(fields, "."))


Maybe the specific error should be wrapping the generic error from the caller?

Yeah, that's true... Let me fix that

astefanutti · 2025-03-12T14:55:05Z

pkg/runtime/core/trainingruntime.go

@@ -98,31 +105,86 @@ func (r *TrainingRuntime) buildObjects(
 		// The JobSetTemplateSpec annotations are overridden by the TrainJob Annotations (.spec.annotations).
 		propagationAnnotations[k] = v
 	}
+	jobSetSpecApply, err := apply.FromTypedObjWithFields[jobsetv1alpha2ac.JobSetSpecApplyConfiguration](&jobsetv1alpha2.JobSet{


Ideally, to be strictly correct, the TrainingRuntime would be Get as unstructured in NewObjects to avoid pollution from zero / nil fields from the typed struct. It is converted to unstructured anyways in FromTypedObjWithFields.

Yes, distinguishing zero /null is important for SSA.
So, Ideally, we want to get unstructured objects in

trainer/pkg/runtime/core/trainingruntime.go

Lines 75 to 79 in b89ce84

err := r.client.Get(ctx, client.ObjectKey{Namespace: trainJob.Namespace, Name: trainJob.Spec.RuntimeRef.Name}, &trainingRuntime)

if err != nil {

return nil, fmt.Errorf("%w: %w", errorNotFoundSpecifiedTrainingRuntime, err)

}

return r.buildObjects(ctx, trainJob, trainingRuntime.Spec.Template, trainingRuntime.Spec.MLPolicy, trainingRuntime.Spec.PodGroupPolicy)

.

However, it will bring MPI unrelated change in this PR. So, let me open issue and do it as a follow-up, thanks.

Opened: #2515

andreyvelich · 2025-03-12T15:50:48Z

So I understand - you are going to do this for every MPI implementation? Is that sustainable for developers?

I think, this should be driven by user requirements and ML use-cases.
Initially, we want to support only OpenMPI since we can build DeepSpeed and MLX runtimes with it:

tenzen-y · 2025-03-12T19:09:48Z

I think, this should be driven by user requirements and ML use-cases.
Initially, we want to support only OpenMPI since we can build DeepSpeed and MLX runtimes with it:

Initially, yes. But eventually, I want to support MPICH and IntelMPI as well.

Signed-off-by: Yuki Iwai <[email protected]>

tenzen-y · 2025-03-12T19:24:35Z

@andreyvelich @astefanutti I addressed all your comments, PTAL thanks.

andreyvelich · 2025-03-13T00:36:17Z

pkg/util/testing/compare.go

+	PodSetEndpointsCmpOpts = cmp.Transformer("Seq", func(a iter.Seq[string]) []string { return slices.Collect(a) })
+)
+
+func SecretDataComparer(a, b map[string][]byte) bool {


Can we make this util func more explicit ?

Suggested change

func SecretDataComparer(a, b map[string][]byte) bool {

func MPISecretDataComparer(a, b map[string][]byte) bool {

andreyvelich · 2025-03-13T00:42:35Z

pkg/util/testing/wrapper.go

+
+func (j *JobSetWrapper) LauncherReplica() *JobSetWrapper {
+	for i, rJob := range j.Spec.ReplicatedJobs {
+		if rJob.Name == constants.JobTrainerNode {


Do we really need this check ?
E.g. JobSet wrapper always has Trainer Node Job:

trainer/pkg/util/testing/wrapper.go

Line 89 in 76dff03

Name: constants.JobTrainerNode,

This is needed for an array right shift operation.
ReplicatedJob must have Jobs in the following order:

Initializer

Laucher

Trainer

@tenzen-y If we don't really use DependsOn API for Launcher <-> Trainer relationship, do we need to preserve order of ReplicatedJob (Launcher is set before Trainer Node) in our testing ?

Slice should guarantee the order. So, we want to keep using same order, anyplaces.

andreyvelich · 2025-03-13T00:43:01Z

pkg/util/testing/wrapper.go

-												},
+												VolumeMounts: []corev1.VolumeMount{{
+													Name:      jobsetplgconsts.VolumeNameInitializer,
+													MountPath: "constants.ModelMountPath",


Suggested change

MountPath: "constants.ModelMountPath",

MountPath: constants.ModelMountPath,

andreyvelich · 2025-03-13T00:43:21Z

pkg/util/testing/wrapper.go

+													},
+													{
+														Name:      jobsetplgconsts.VolumeNameInitializer,
+														MountPath: "constants.ModelMountPath",


Suggested change

MountPath: "constants.ModelMountPath",

MountPath: constants.ModelMountPath,

andreyvelich · 2025-03-13T00:47:43Z

pkg/util/testing/wrapper.go

+	return j
+}
+
+func (j *JobSetWrapper) Env(rJobName, containerName string, envs ...corev1.EnvVar) *JobSetWrapper {


Please can you add TODO statement in the code that these functions should be refactored in favour of this Env() API:

trainer/pkg/util/testing/wrapper.go

Line 220 in 76dff03

func (j *JobSetWrapper) ContainerTrainerEnv(env []corev1.EnvVar) *JobSetWrapper {

trainer/pkg/util/testing/wrapper.go

Line 293 in 76dff03

func (j *JobSetWrapper) ContainerDatasetInitializerEnv(env []corev1.EnvVar) *JobSetWrapper {

trainer/pkg/util/testing/wrapper.go

Line 319 in 76dff03

func (j *JobSetWrapper) ContainerModelInitializerEnv(env []corev1.EnvVar) *JobSetWrapper {

andreyvelich · 2025-03-13T00:48:34Z

pkg/util/testing/wrapper.go

-														},
+														VolumeMounts: []corev1.VolumeMount{{
+															Name:      jobsetplgconsts.VolumeNameInitializer,
+															MountPath: "constants.ModelMountPath",


I think, there are a few places where you accidentally replaced this value.

Suggested change

MountPath: "constants.ModelMountPath",

MountPath: constants.ModelMountPath,

Oh, good catch...

andreyvelich · 2025-03-13T01:16:03Z

test/integration/controller/trainjob_controller_test.go

@@ -33,7 +34,7 @@ import (

 	trainer "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1"
 	"github.com/kubeflow/trainer/pkg/constants"
-	jobsetplugin "github.com/kubeflow/trainer/pkg/runtime/framework/plugins/jobset"
+	jobsetplgconsts "github.com/kubeflow/trainer/pkg/runtime/framework/plugins/jobset/constants"
 	testingutil "github.com/kubeflow/trainer/pkg/util/testing"
 	"github.com/kubeflow/trainer/test/integration/framework"
 	"github.com/kubeflow/trainer/test/util"


Should we combine ginkgo.When("Reconciling TrainJob") and ginkgo.Describe("TrainJob controller") under single Ginkgo context ?
I don't think that we use trainjob_controller_test.go file for integration tests outside of TrainJob controller context.
I just think that this part can be easily moved under:

ginkgo.Describe("TrainJob controller")

We do not want to do it. Ideally, we should decouple framework tests to dedicated When (currently, we have all cases in a single When).

I see, yeah that might make more sense to refactor it in the future.

Signed-off-by: Yuki Iwai <[email protected]>

andreyvelich

I think, we are ready to merge this.
Thanks again for this tremendous work for such short period of time @tenzen-y!
Feel free to unhold.

/lgtm
/approve
/hold

google-oss-prow · 2025-03-13T01:59:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y · 2025-03-13T04:59:52Z

Thank you all! I would especially like to say special thanks to the leading reviewers, @astefanutti and @andreyvelich!

/hold cancel

Signed-off-by: Yuki Iwai <[email protected]> Use numNodes=1 as default mpi_distributed ClusterTrainingRuntime Signed-off-by: Yuki Iwai <[email protected]> Implemenet MPI Plugin for OpenMPI Signed-off-by: Yuki Iwai <[email protected]>

* Implemenet MPI Plugin for OpenMPI Signed-off-by: Yuki Iwai <[email protected]> * Directory pass the JobSetApplyconfiguration to RuntimeInfo Signed-off-by: Yuki Iwai <[email protected]> * Make repeated string as constants Signed-off-by: Yuki Iwai <[email protected]> * Use numNodes=1 as default mpi_distributed ClusterTrainingRuntime Signed-off-by: Yuki Iwai <[email protected]> * Remove unused errors Signed-off-by: Yuki Iwai <[email protected]> * Rename runLauncherAsWorker with runLauncherAsNode Signed-off-by: Yuki Iwai <[email protected]> * Fix unintended constants usage for ModelMountPath Signed-off-by: Yuki Iwai <[email protected]> * Rename SecretDataComparer with MPISecretDataComparer Signed-off-by: Yuki Iwai <[email protected]> * Add TODO STAEEMENT to deprecated env wrappers. Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]>

google-oss-prow bot added the do-not-merge/work-in-progress label Mar 10, 2025

google-oss-prow bot requested review from jinchihe and kuizhiqing March 10, 2025 12:35

google-oss-prow bot added the size/XXL label Mar 10, 2025

tenzen-y force-pushed the implement-mpi-plugin branch 5 times, most recently from 261cded to 590f258 Compare March 10, 2025 13:20

tenzen-y marked this pull request as ready for review March 10, 2025 15:21

google-oss-prow bot removed the do-not-merge/work-in-progress label Mar 10, 2025

google-oss-prow bot requested a review from Electronic-Waste March 10, 2025 15:21

google-oss-prow bot added the do-not-merge/hold label Mar 10, 2025

tenzen-y force-pushed the implement-mpi-plugin branch 5 times, most recently from 43cbefe to 8860550 Compare March 11, 2025 11:14

Implemenet MPI Plugin for OpenMPI

8282e74

Signed-off-by: Yuki Iwai <[email protected]>

tenzen-y force-pushed the implement-mpi-plugin branch from 8860550 to 8282e74 Compare March 11, 2025 11:17

astefanutti reviewed Mar 11, 2025

View reviewed changes

andreyvelich reviewed Mar 11, 2025

View reviewed changes

google-oss-prow bot requested review from Syulin7 and alculquicondor March 11, 2025 23:34

tenzen-y mentioned this pull request Mar 12, 2025

Support All kind of VolumeSource for ApplyConfiguration #2494

Closed

10 tasks

tenzen-y force-pushed the implement-mpi-plugin branch from 124e2f2 to 031205d Compare March 12, 2025 02:36

tenzen-y added 3 commits March 12, 2025 11:59

Make repeated string as constants

e891fc8

Signed-off-by: Yuki Iwai <[email protected]>

Use numNodes=1 as default mpi_distributed ClusterTrainingRuntime

5fe8d43

Signed-off-by: Yuki Iwai <[email protected]>

Remove unused errors

b938a39

Signed-off-by: Yuki Iwai <[email protected]>

kannon92 reviewed Mar 12, 2025

View reviewed changes

astefanutti reviewed Mar 12, 2025

View reviewed changes

Rename runLauncherAsWorker with runLauncherAsNode

76dff03

Signed-off-by: Yuki Iwai <[email protected]>

tenzen-y force-pushed the implement-mpi-plugin branch from 3c9179e to 76dff03 Compare March 12, 2025 19:20

andreyvelich reviewed Mar 13, 2025

View reviewed changes

tenzen-y added 3 commits March 13, 2025 10:31

Fix unintended constants usage for ModelMountPath

e02db90

Signed-off-by: Yuki Iwai <[email protected]>

Rename SecretDataComparer with MPISecretDataComparer

19b95a5

Signed-off-by: Yuki Iwai <[email protected]>

Add TODO STAEEMENT to deprecated env wrappers.

e6932e3

Signed-off-by: Yuki Iwai <[email protected]>

andreyvelich reviewed Mar 13, 2025

View reviewed changes

google-oss-prow bot assigned andreyvelich Mar 13, 2025

google-oss-prow bot added the lgtm label Mar 13, 2025

google-oss-prow bot added the approved label Mar 13, 2025

google-oss-prow bot removed the do-not-merge/hold label Mar 13, 2025

google-oss-prow bot merged commit f64bdf2 into kubeflow:master Mar 13, 2025
14 checks passed

tenzen-y deleted the implement-mpi-plugin branch March 13, 2025 05:04

andreyvelich mentioned this pull request Mar 14, 2025

Migrate Info.Trainer to Info.TemplateSpec.PodSet #2520

Merged

1 task

	sshAuthMountPath: /home/mpiuser/.ssh
	sshAuthMountPath: /root/.ssh

	for e := range ps.Endpoints {
	hostFile.WriteString(fmt.Sprintf("%s slots=%d\n", e, slots))
	}

		@@ -58,7 +58,7 @@ func (p PlainML) EnforceMLPolicy(info runtime.Info, trainJob *trainer.TrainJob

	err := r.client.Get(ctx, client.ObjectKey{Namespace: trainJob.Namespace, Name: trainJob.Spec.RuntimeRef.Name}, &trainingRuntime)
	if err != nil {
	return nil, fmt.Errorf("%w: %w", errorNotFoundSpecifiedTrainingRuntime, err)
	}
	return r.buildObjects(ctx, trainJob, trainingRuntime.Spec.Template, trainingRuntime.Spec.MLPolicy, trainingRuntime.Spec.PodGroupPolicy)

	func SecretDataComparer(a, b map[string][]byte) bool {
	func MPISecretDataComparer(a, b map[string][]byte) bool {

	MountPath: "constants.ModelMountPath",
	MountPath: constants.ModelMountPath,

Implemenet MPI Plugin for OpenMPI #2493

Implemenet MPI Plugin for OpenMPI #2493

Conversation

tenzen-y commented Mar 10, 2025 • edited Loading

tenzen-y commented Mar 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

tenzen-y Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y commented Mar 10, 2025 •

edited

Loading

tenzen-y Mar 11, 2025 •

edited

Loading

tenzen-y Mar 11, 2025 •

edited

Loading

tenzen-y Mar 11, 2025 •

edited

Loading

andreyvelich Mar 11, 2025 •

edited

Loading

andreyvelich Mar 11, 2025 •

edited

Loading