-
Notifications
You must be signed in to change notification settings - Fork 763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(runtimes): Support MLX Distributed Runtime with OpenMPI #2565
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: shravan-achar, mstei4176, saileshd1402, awni, angeloskath, blaizzy, jeiksegovia. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Pull Request Test Coverage Report for Build 14096955415Details
💛 - Coveralls |
bbf7d29
to
a6d4810
Compare
labels: | ||
trainer.kubeflow.org/accelerator: cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on this label @Electronic-Waste @tenzen-y ?
If you want, I can remove it for now as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, it will be better if we remove this label for now until we come to an agreement on runtime labels: https://cloud-native.slack.com/archives/C0742LDFZ4K/p1741263570091899
cmd/runtimes/mlx/Dockerfile
Outdated
ARG MPI_VERSION=5.0.7 | ||
RUN wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-${MPI_VERSION}.tar.gz | ||
RUN tar -xvf openmpi-${MPI_VERSION}.tar.gz && rm -rf openmpi-${MPI_VERSION}.tar.gz | ||
RUN cd openmpi-${MPI_VERSION} && ./configure --prefix=/usr/local && make -j"$(nproc)" install | ||
RUN rm -rf openmpi-${MPI_VERSION} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why the image build takes more than 3 hours in GitHub actions.
Locally I was able to build it in 10 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, for DeepSpeed image building OpenMPI from sources takes ~ 50 minutes: https://github.com/kubeflow/trainer/actions/runs/14005402431/job/39218624074#step:7:40444
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, locally it runs only once.
Let me try to just run make install
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhhh. It's slower than before. Almost 6h. What had happened?...
a5da3c7
to
78d666b
Compare
@@ -0,0 +1,30 @@ | |||
FROM mpioperator/base:v0.6.0 AS mpi | |||
FROM debian:trixie |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed with @tenzen-y that we will use debian image for MLX Runtime for now since we see problems with running OpenMPI in Fedora.
We will investigate later if we require to improve this runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interesting initiative, great job everyone!
But I wonder how you plan to run this on Linux since at the moment MLX only runs on Apple silicon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Blaizzy! As you can see this image builds mlx
and mlx-data
from source, also MLX already runs some of their CI tests on Linux machines: https://github.com/ml-explore/mlx/blob/main/.circleci/config.yml#L72
I believe, that might be useful for folks who want to experiment with distributed MLX and MPI inside Kubernetes environment.
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
pip_index_url, | ||
packages_str, | ||
# For the OpenMPI, the packages must be installed for the mpiuser. | ||
"--user" if runtime_entrypoint[0] == constants.MPI_ENTRYPOINT else "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a bug that for OpenMPI, we should install packages for the user.
/hold |
Signed-off-by: Andrey Velichkevich <[email protected]>
59292fb
to
dd88100
Compare
cmd/runtimes/deepspeed/Dockerfile
Outdated
# Give mpiuser permission to download packages and HF models. | ||
RUN chown -R mpiuser:mpiuser /home/mpiuser/.local | ||
RUN chown -R mpiuser:mpiuser /home/mpiuser/.cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, fixed the permission issue.
With that changes, the example works fine for me cc @tenzen-y
/hold cancel |
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fixes: #2047
This is implementation of MLX distributed runtime in Kubeflow Trainer V2 🎉
I added example for MNIST distributed training with MLX and local minikube cluster.
Let's merge it after: #2559, since this PR includes commit from that branch.
/cc @kubeflow/wg-training-leads @gaocegege @Electronic-Waste @astefanutti @saileshd1402 @shravan-achar @akshaychitneni @franciscojavierarceo @awni @angeloskath @Blaizzy @jeiksegovia @mstei4176