-
Notifications
You must be signed in to change notification settings - Fork 758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Helm chart for kubeflow trainer #2435
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
charts/trainer/README.md
Outdated
@@ -0,0 +1,105 @@ | |||
# trainer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this @ChenYi015!
Did you get a chance to explore Kueue script to generate Helm Charts from Kustomize manifests ?
I think, that should significantly help us to keep Kustomize and Charts in sync.
https://github.com/kubernetes-sigs/kueue/blob/main/hack/update-helm.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take a look at the shell script, it is a bit complicated. I think it may be more easier to sync manifests templated by Helm charts to Kustomize manifests, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a shell script hack/sync-manifests
. Now one can execute make sync-manifests
to sync the Kustomize manifests from the manifests templated by the Helm chart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this effort @ChenYi015!
@tenzen-y @kannon92 @ahg-g @astefanutti @Electronic-Waste @kubeflow/wg-training-leads What do you think about it ?
Should we go other way around to sync Helm Charts to Kustomize Manifests ?
We can use the same approach for JobSet/Kueue if that is easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kubeflow/wg-manifests-leads @kubeflow/release-team @varodrig to review script of sync Kustomize + Helm automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7dc781c
to
c0b1d7d
Compare
@ChenYi015 Given that we created Helm Charts for JobSet, are we ready to finish this PR? |
Not yet. We are still working on that change. Once we have a release it should push to the registry. |
a4b36db
to
4cd2344
Compare
4cd2344
to
77c023d
Compare
/hold for waiting jobset helm chart to be published |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great work! Thank for working on this.
I left a couple of comments and recommendations.
77c023d
to
097e217
Compare
/hold cancel for jobset chart v0.8.0 has been released. I have updated the PR, now one can test the Helm chart installation with the following comand:
|
/unhold |
097e217
to
6620551
Compare
@ChenYi015 keep me posted once you have the updates . thank you again for working on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this great contribution @ChenYi015!
I left my initial comments.
/assign @kubeflow/wg-training-leads @franciscojavierarceo @Electronic-Waste @astefanutti @saileshd1402 @kubeflow/wg-manifests-leads @chasecadet
6620551
to
0917b10
Compare
This is great! @ChenYi015 please add any implementation details or really anything to https://github.com/chasecadet/KEP/tree/master/proposals/649-kubeflow-helm-support. We are working on a proposal and would love your feedback! |
952879b
to
6dfc8a7
Compare
- clusterrole.yaml | ||
- clusterrolebinding.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that we should do that.
It is easier for us to leverage kubebuilder to define RBAC for each plugin that our runtime framework implements, so we always understand why we need to have this policy:
https://github.com/kubeflow/trainer/blob/master/pkg/runtime/framework/plugins/mpi/mpi.go#L61-L62
cc @tenzen-y
659aeca
to
4d6f4ca
Compare
Signed-off-by: Yi Chen <[email protected]>
4d6f4ca
to
c43de99
Compare
setup | ||
update_crds | ||
# Will not sync RBAC from Helm charts, for now we are using Kubebuilder to generate RBAC files. | ||
# update_rbac | ||
update_controller_manager | ||
update_webhook | ||
# There is something annoying when managing training runtimes in the trainer Helm chart, maybe we should mange runtimes in a separated Helm chart? | ||
# update_runtimes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have skipped syncing RBAC files from the Helm chart, for now we will keep using Kubebuilder to generate RBAC manifests.
We should think of a better way of managing training runtimes with Helm, I will make another issue to track that and implement it in the future.
I think I have addressed most of the comments, PTAL when you have time. @andreyvelich @kubeflow/wg-training-leads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review, I left a few more comments.
app.kubernetes.io/instance: kubeflow-trainer | ||
app.kubernetes.io/version: "2.0.0" | ||
app.kubernetes.io/managed-by: Kustomize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove these labels for now, since we need to talk how to keep them correct.
app.kubernetes.io/instance: kubeflow-trainer | |
app.kubernetes.io/version: "2.0.0" | |
app.kubernetes.io/managed-by: Kustomize |
@@ -0,0 +1,77 @@ | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep name of this file manager.yaml
and include Deployment + Service in this file ?
I don't think it is necessary to distinguish it.
app.kubernetes.io/part-of: kubeflow | ||
app.kubernetes.io/component: manager | ||
spec: | ||
replicas: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed.
replicas: 1 |
matchLabels: | ||
app.kubernetes.io/name: kubeflow-trainer | ||
app.kubernetes.io/instance: kubeflow-trainer | ||
app.kubernetes.io/part-of: kubeflow | ||
app.kubernetes.io/component: manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matchLabels: | |
app.kubernetes.io/name: kubeflow-trainer | |
app.kubernetes.io/instance: kubeflow-trainer | |
app.kubernetes.io/part-of: kubeflow | |
app.kubernetes.io/component: manager | |
matchLabels: | |
app.kubernetes.io/name: kubeflow-trainer | |
app.kubernetes.io/part-of: kubeflow | |
app.kubernetes.io/component: manager |
spec: | ||
containers: | ||
- name: manager | ||
image: kubeflow/trainer-controller-manager:2.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tag doesn't exist, and we should not add tag to the base manifests.
image: kubeflow/trainer-controller-manager:2.0.0 | |
image: kubeflow/trainer-controller-manager |
kind: Deployment | ||
metadata: | ||
name: kubeflow-trainer-controller-manager | ||
namespace: kubeflow-system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The namespace should be populated by overlays
namespace: kubeflow-system |
name: kubeflow-trainer-controller-manager | ||
namespace: kubeflow-system | ||
labels: | ||
app.kubernetes.io/name: kubeflow-trainer | ||
app.kubernetes.io/instance: kubeflow-trainer | ||
app.kubernetes.io/version: "2.0.0" | ||
app.kubernetes.io/managed-by: Kustomize | ||
app.kubernetes.io/part-of: kubeflow | ||
app.kubernetes.io/component: manager | ||
spec: | ||
selector: | ||
app.kubernetes.io/name: kubeflow-trainer | ||
app.kubernetes.io/instance: kubeflow-trainer | ||
app.kubernetes.io/part-of: kubeflow | ||
app.kubernetes.io/component: manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name: kubeflow-trainer-controller-manager | |
namespace: kubeflow-system | |
labels: | |
app.kubernetes.io/name: kubeflow-trainer | |
app.kubernetes.io/instance: kubeflow-trainer | |
app.kubernetes.io/version: "2.0.0" | |
app.kubernetes.io/managed-by: Kustomize | |
app.kubernetes.io/part-of: kubeflow | |
app.kubernetes.io/component: manager | |
spec: | |
selector: | |
app.kubernetes.io/name: kubeflow-trainer | |
app.kubernetes.io/instance: kubeflow-trainer | |
app.kubernetes.io/part-of: kubeflow | |
app.kubernetes.io/component: manager | |
name: kubeflow-trainer-controller-manager | |
labels: | |
app.kubernetes.io/name: kubeflow-trainer | |
app.kubernetes.io/part-of: kubeflow | |
app.kubernetes.io/component: manager | |
spec: | |
selector: | |
app.kubernetes.io/name: kubeflow-trainer | |
app.kubernetes.io/part-of: kubeflow | |
app.kubernetes.io/component: manager |
@@ -0,0 +1,39 @@ | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed, since we currently manage webhook service port under manager service:
- name: webhook-server |
Not sure if that make sense to separate them since we run 1 Deployment for Manager + Webhook.
Thoughts @tenzen-y @astefanutti ?
@@ -1,66 +0,0 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be with this name (manfiests.yaml) to allow kubebuilder generates manifests.
@@ -1,3 +1,19 @@ | |||
# | |||
# Copyright 2024 The Kubeflow authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update the year for header licence please ?
What this PR does / why we need it:
Close #1197
Checklist: