Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Helm chart for kubeflow trainer #2435

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ChenYi015
Copy link

@ChenYi015 ChenYi015 commented Feb 13, 2025

What this PR does / why we need it:

Close #1197

Checklist:

  • Docs included if any changes are user facing

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -0,0 +1,105 @@
# trainer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this @ChenYi015!
Did you get a chance to explore Kueue script to generate Helm Charts from Kustomize manifests ?
I think, that should significantly help us to keep Kustomize and Charts in sync.

https://github.com/kubernetes-sigs/kueue/blob/main/hack/update-helm.sh

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look at the shell script, it is a bit complicated. I think it may be more easier to sync manifests templated by Helm charts to Kustomize manifests, WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a shell script hack/sync-manifests. Now one can execute make sync-manifests to sync the Kustomize manifests from the manifests templated by the Helm chart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this effort @ChenYi015!
@tenzen-y @kannon92 @ahg-g @astefanutti @Electronic-Waste @kubeflow/wg-training-leads What do you think about it ?
Should we go other way around to sync Helm Charts to Kustomize Manifests ?
We can use the same approach for JobSet/Kueue if that is easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kubeflow/wg-manifests-leads @kubeflow/release-team @varodrig to review script of sync Kustomize + Helm automatically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenYi015 ChenYi015 marked this pull request as ready for review February 14, 2025 09:04
@google-oss-prow google-oss-prow bot requested a review from jinchihe February 14, 2025 09:04
@ChenYi015 ChenYi015 marked this pull request as draft February 14, 2025 09:05
@ChenYi015 ChenYi015 force-pushed the feature/helm-charts-v2 branch 2 times, most recently from 7dc781c to c0b1d7d Compare February 14, 2025 10:54
@andreyvelich
Copy link
Member

@ChenYi015 Given that we created Helm Charts for JobSet, are we ready to finish this PR?
@kannon92 did we push charts to the OCI registry after this PR: kubernetes-sigs/jobset#792 ?

@kannon92
Copy link
Contributor

@ChenYi015 Given that we created Helm Charts for JobSet, are we ready to finish this PR? @kannon92 did we push charts to the OCI registry after this PR: kubernetes-sigs/jobset#792 ?

Not yet. We are still working on that change. Once we have a release it should push to the registry.

@ChenYi015 ChenYi015 force-pushed the feature/helm-charts-v2 branch from a4b36db to 4cd2344 Compare February 25, 2025 03:42
@ChenYi015 ChenYi015 marked this pull request as ready for review February 25, 2025 03:48
@ChenYi015 ChenYi015 force-pushed the feature/helm-charts-v2 branch from 4cd2344 to 77c023d Compare February 26, 2025 07:26
@ChenYi015
Copy link
Author

/hold for waiting jobset helm chart to be published

Copy link
Contributor

@varodrig varodrig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work! Thank for working on this.

I left a couple of comments and recommendations.

@ChenYi015 ChenYi015 force-pushed the feature/helm-charts-v2 branch from 77c023d to 097e217 Compare March 3, 2025 03:16
@ChenYi015
Copy link
Author

/hold cancel for jobset chart v0.8.0 has been released. I have updated the PR, now one can test the Helm chart installation with the following comand:

helm upgrade kubeflow-trainer charts/kubeflow-trainer \
      --install \
      --dependency-update \
      --namespace kubeflow-system \
      --create-namespace \
      --wait \
      --timeout 5m0s \
      --set jobset.install=true

@ChenYi015
Copy link
Author

/unhold

@ChenYi015 ChenYi015 force-pushed the feature/helm-charts-v2 branch from 097e217 to 6620551 Compare March 3, 2025 07:54
@varodrig
Copy link
Contributor

varodrig commented Mar 3, 2025

@ChenYi015 keep me posted once you have the updates . thank you again for working on this.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great contribution @ChenYi015!
I left my initial comments.
/assign @kubeflow/wg-training-leads @franciscojavierarceo @Electronic-Waste @astefanutti @saileshd1402 @kubeflow/wg-manifests-leads @chasecadet

@ChenYi015 ChenYi015 force-pushed the feature/helm-charts-v2 branch from 6620551 to 0917b10 Compare March 4, 2025 03:41
@chasecadet
Copy link

Thank you for this great contribution @ChenYi015! I left my initial comments. /assign @kubeflow/wg-training-leads @franciscojavierarceo @Electronic-Waste @astefanutti @saileshd1402 @kubeflow/wg-manifests-leads @chasecadet

This is great! @ChenYi015 please add any implementation details or really anything to https://github.com/chasecadet/KEP/tree/master/proposals/649-kubeflow-helm-support. We are working on a proposal and would love your feedback!

@ChenYi015 ChenYi015 force-pushed the feature/helm-charts-v2 branch 5 times, most recently from 952879b to 6dfc8a7 Compare March 6, 2025 07:22
Comment on lines 19 to 20
- clusterrole.yaml
- clusterrolebinding.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that we should do that.
It is easier for us to leverage kubebuilder to define RBAC for each plugin that our runtime framework implements, so we always understand why we need to have this policy:
https://github.com/kubeflow/trainer/blob/master/pkg/runtime/framework/plugins/mpi/mpi.go#L61-L62

cc @tenzen-y

@ChenYi015 ChenYi015 force-pushed the feature/helm-charts-v2 branch from 4d6f4ca to c43de99 Compare March 7, 2025 07:41
Comment on lines +122 to +129
setup
update_crds
# Will not sync RBAC from Helm charts, for now we are using Kubebuilder to generate RBAC files.
# update_rbac
update_controller_manager
update_webhook
# There is something annoying when managing training runtimes in the trainer Helm chart, maybe we should mange runtimes in a separated Helm chart?
# update_runtimes
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have skipped syncing RBAC files from the Helm chart, for now we will keep using Kubebuilder to generate RBAC manifests.

We should think of a better way of managing training runtimes with Helm, I will make another issue to track that and implement it in the future.

@ChenYi015
Copy link
Author

I think I have addressed most of the comments, PTAL when you have time. @andreyvelich @kubeflow/wg-training-leads

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review, I left a few more comments.

Comment on lines +24 to +26
app.kubernetes.io/instance: kubeflow-trainer
app.kubernetes.io/version: "2.0.0"
app.kubernetes.io/managed-by: Kustomize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove these labels for now, since we need to talk how to keep them correct.

Suggested change
app.kubernetes.io/instance: kubeflow-trainer
app.kubernetes.io/version: "2.0.0"
app.kubernetes.io/managed-by: Kustomize

@@ -0,0 +1,77 @@
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep name of this file manager.yaml and include Deployment + Service in this file ?
I don't think it is necessary to distinguish it.

app.kubernetes.io/part-of: kubeflow
app.kubernetes.io/component: manager
spec:
replicas: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed.

Suggested change
replicas: 1

Comment on lines +32 to +36
matchLabels:
app.kubernetes.io/name: kubeflow-trainer
app.kubernetes.io/instance: kubeflow-trainer
app.kubernetes.io/part-of: kubeflow
app.kubernetes.io/component: manager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
matchLabels:
app.kubernetes.io/name: kubeflow-trainer
app.kubernetes.io/instance: kubeflow-trainer
app.kubernetes.io/part-of: kubeflow
app.kubernetes.io/component: manager
matchLabels:
app.kubernetes.io/name: kubeflow-trainer
app.kubernetes.io/part-of: kubeflow
app.kubernetes.io/component: manager

spec:
containers:
- name: manager
image: kubeflow/trainer-controller-manager:2.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tag doesn't exist, and we should not add tag to the base manifests.

Suggested change
image: kubeflow/trainer-controller-manager:2.0.0
image: kubeflow/trainer-controller-manager

kind: Deployment
metadata:
name: kubeflow-trainer-controller-manager
namespace: kubeflow-system
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The namespace should be populated by overlays

Suggested change
namespace: kubeflow-system

Comment on lines +20 to +34
name: kubeflow-trainer-controller-manager
namespace: kubeflow-system
labels:
app.kubernetes.io/name: kubeflow-trainer
app.kubernetes.io/instance: kubeflow-trainer
app.kubernetes.io/version: "2.0.0"
app.kubernetes.io/managed-by: Kustomize
app.kubernetes.io/part-of: kubeflow
app.kubernetes.io/component: manager
spec:
selector:
app.kubernetes.io/name: kubeflow-trainer
app.kubernetes.io/instance: kubeflow-trainer
app.kubernetes.io/part-of: kubeflow
app.kubernetes.io/component: manager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: kubeflow-trainer-controller-manager
namespace: kubeflow-system
labels:
app.kubernetes.io/name: kubeflow-trainer
app.kubernetes.io/instance: kubeflow-trainer
app.kubernetes.io/version: "2.0.0"
app.kubernetes.io/managed-by: Kustomize
app.kubernetes.io/part-of: kubeflow
app.kubernetes.io/component: manager
spec:
selector:
app.kubernetes.io/name: kubeflow-trainer
app.kubernetes.io/instance: kubeflow-trainer
app.kubernetes.io/part-of: kubeflow
app.kubernetes.io/component: manager
name: kubeflow-trainer-controller-manager
labels:
app.kubernetes.io/name: kubeflow-trainer
app.kubernetes.io/part-of: kubeflow
app.kubernetes.io/component: manager
spec:
selector:
app.kubernetes.io/name: kubeflow-trainer
app.kubernetes.io/part-of: kubeflow
app.kubernetes.io/component: manager

@@ -0,0 +1,39 @@
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed, since we currently manage webhook service port under manager service:

- name: webhook-server

Not sure if that make sense to separate them since we run 1 Deployment for Manager + Webhook.
Thoughts @tenzen-y @astefanutti ?

@@ -1,66 +0,0 @@
---
Copy link
Member

@andreyvelich andreyvelich Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be with this name (manfiests.yaml) to allow kubebuilder generates manifests.

@@ -1,3 +1,19 @@
#
# Copyright 2024 The Kubeflow authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the year for header licence please ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Helm Charts for Kubeflow Trainer V2
6 participants