Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ttlSecondsAfterFinished for MPIJob, not only launcher #644

Open
hy00nc opened this issue May 27, 2024 · 6 comments
Open

ttlSecondsAfterFinished for MPIJob, not only launcher #644

hy00nc opened this issue May 27, 2024 · 6 comments

Comments

@hy00nc
Copy link

hy00nc commented May 27, 2024

Do we have plan to extend ttlsSecondsAfterFinished to the MPIJob-level, not just the launcher?

@alculquicondor
Copy link
Collaborator

do you mean that you want to keep the pod objects until the ttl finishes?

Or do you want to keep them running?

@hy00nc
Copy link
Author

hy00nc commented May 27, 2024

@alculquicondor, thanks for the reply. I want the mpijob resource itself to be deleted after ttl, just like how ttlSecondsAfterFinished works in MPIJob V1. In the current implementation, it remains uncleaned until deleted explicitly, right?

@alculquicondor
Copy link
Collaborator

oh, gotcha. I don't know if that's how other Kubeflow APIs work. If they do, we can bring MPIJob back to parity.

@tenzen-y
Copy link
Member

oh, gotcha. I don't know if that's how other Kubeflow APIs work. If they do, we can bring MPIJob back to parity.

Indeed, the other Jobs will be removed after ttlSecondsAfterFinished like this:

https://github.com/kubeflow/training-operator/blob/be5df91eb43e2fdfa1b0a7005f7aeb8cc3a52fb1/pkg/controller.v1/common/job.go#L428-L435

@hy00nc
Copy link
Author

hy00nc commented May 28, 2024

Would it make sense to extend activeDeadlineSeconds and backoffLimit as well? I guess these are also currently limited to launcher, but other kubeflow jobs apply it to the job-level.

@alculquicondor
Copy link
Collaborator

Those should be fine just in Job, because the launcher job is what controls the execution. If it finishes as Failed, the rest of the pods would terminate too, IIRC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants