Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(runtimes): Support MLX Distributed Runtime with OpenMPI #2565

Merged
merged 9 commits into from
Mar 27, 2025

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Mar 24, 2025

Fixes: #2047

This is implementation of MLX distributed runtime in Kubeflow Trainer V2 🎉

I added example for MNIST distributed training with MLX and local minikube cluster.

Let's merge it after: #2559, since this PR includes commit from that branch.

/cc @kubeflow/wg-training-leads @gaocegege @Electronic-Waste @astefanutti @saileshd1402 @shravan-achar @akshaychitneni @franciscojavierarceo @awni @angeloskath @Blaizzy @jeiksegovia @mstei4176

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: shravan-achar, mstei4176, saileshd1402, awni, angeloskath, blaizzy, jeiksegovia.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Fixes: #2047

This is implementation of MLX distributed runtime in Kubeflow Trainer V2 🎉

I added example for MNIST distributed training with MLX and local minikube cluster.

Let's merge it after: #2559, since this PR includes commit from that branch.

/cc @kubeflow/wg-training-leads @Electronic-Waste @astefanutti @saileshd1402 @shravan-achar @akshaychitneni @franciscojavierarceo @awni @angeloskath @Blaizzy @jeiksegovia @mstei4176

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@coveralls
Copy link

coveralls commented Mar 24, 2025

Pull Request Test Coverage Report for Build 14096955415

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 65.938%

Totals Coverage Status
Change from base Build 14092676780: 0.0%
Covered Lines: 1719
Relevant Lines: 2607

💛 - Coveralls

@andreyvelich andreyvelich force-pushed the mlx-runtime branch 2 times, most recently from bbf7d29 to a6d4810 Compare March 24, 2025 19:47
@andreyvelich andreyvelich changed the title [WIP] feat(runtimes): Support MLX Distributed Runtime with OpenMPI feat(runtimes): Support MLX Distributed Runtime with OpenMPI Mar 24, 2025
Comment on lines 5 to 6
labels:
trainer.kubeflow.org/accelerator: cpu
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on this label @Electronic-Waste @tenzen-y ?
If you want, I can remove it for now as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it will be better if we remove this label for now until we come to an agreement on runtime labels: https://cloud-native.slack.com/archives/C0742LDFZ4K/p1741263570091899

Comment on lines 24 to 28
ARG MPI_VERSION=5.0.7
RUN wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-${MPI_VERSION}.tar.gz
RUN tar -xvf openmpi-${MPI_VERSION}.tar.gz && rm -rf openmpi-${MPI_VERSION}.tar.gz
RUN cd openmpi-${MPI_VERSION} && ./configure --prefix=/usr/local && make -j"$(nproc)" install
RUN rm -rf openmpi-${MPI_VERSION}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why the image build takes more than 3 hours in GitHub actions.
Locally I was able to build it in 10 minutes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, for DeepSpeed image building OpenMPI from sources takes ~ 50 minutes: https://github.com/kubeflow/trainer/actions/runs/14005402431/job/39218624074#step:7:40444

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

截屏2025-03-25 10 46 47

Is it normal to execute this command for many times?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, locally it runs only once.
Let me try to just run make install

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhhh. It's slower than before. Almost 6h. What had happened?...

@@ -0,0 +1,30 @@
FROM mpioperator/base:v0.6.0 AS mpi
FROM debian:trixie
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed with @tenzen-y that we will use debian image for MLX Runtime for now since we see problems with running OpenMPI in Fedora.
We will investigate later if we require to improve this runtime.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interesting initiative, great job everyone!

But I wonder how you plan to run this on Linux since at the moment MLX only runs on Apple silicon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Blaizzy! As you can see this image builds mlx and mlx-data from source, also MLX already runs some of their CI tests on Linux machines: https://github.com/ml-explore/mlx/blob/main/.circleci/config.yml#L72

I believe, that might be useful for folks who want to experiment with distributed MLX and MPI inside Kubernetes environment.

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
pip_index_url,
packages_str,
# For the OpenMPI, the packages must be installed for the mpiuser.
"--user" if runtime_entrypoint[0] == constants.MPI_ENTRYPOINT else "",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a bug that for OpenMPI, we should install packages for the user.

@andreyvelich
Copy link
Member Author

/hold

Signed-off-by: Andrey Velichkevich <[email protected]>
Comment on lines 37 to 39
# Give mpiuser permission to download packages and HF models.
RUN chown -R mpiuser:mpiuser /home/mpiuser/.local
RUN chown -R mpiuser:mpiuser /home/mpiuser/.cache
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, fixed the permission issue.
With that changes, the example works fine for me cc @tenzen-y

@andreyvelich
Copy link
Member Author

/hold cancel

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 41ca967 into kubeflow:master Mar 27, 2025
18 checks passed
@andreyvelich andreyvelich deleted the mlx-runtime branch March 27, 2025 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create MLX Runtime with Kubeflow Trainer
5 participants