-
Notifications
You must be signed in to change notification settings - Fork 758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2401: Kubeflow LLM Trainer V2 #2401
Comments
/kind feature |
@Electronic-Waste Thank you for creating dedicated issue for each task! /area llm |
@andreyvelich Thank you for this! I'm also wondering if we could pin this issue to let more people track our progress. Currently, this issue has been postponed to the second page of issues. It might be inconvenient for other folks to notice this issue:) |
Yes, let me pin it. |
This is the tracking issue for the Kubeflow LLM Trainer V2, a submodule of Kubeflow Training V2: #2170
We aim to solve:
However, the LLM Trainer V2 design is very complex and needs further discussion. So we decided to open a separate issue tracking it.
train()
API #2503TorchTuneConfig
totrain()
API #2504torch
plugin to supporttorchtune
config mutation #2507torch
plugin #2508torchtune
trainer image #2511KEP Updates:
Initial Design (Google Doc): Kubeflow Training V2 LLM Trainer Design
/area runtime
/cc @kubeflow/wg-training-leads @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @astefanutti @varshaprasad96 @franciscojavierarceo @thesuperzapper @rimolive @juliusvonkohout @jbottum @varodrig @Doris-xm @truc0
The text was updated successfully, but these errors were encountered: