Add KFTO Distributed training notebook and scripts#103
Add KFTO Distributed training notebook and scripts#103RHRolun merged 4 commits intorh-aiservices-bu:mainfrom
Conversation
|
/lgtm |
There was a problem hiding this comment.
Awesome PR!
A few things besides the error I got:
- Can you also add a draft for the documentation for this chapter?
- This currently relies on the PyTorch notebook image, while the rest of the workshop relies on the TensorFlow image. To unify this we may need to change the workshop to use pytorch instead (I believe that KFTO is pytorch only?). We luckily already have a draft PR for that here, but it needs to go through some work to get merged in and synched with docs. Nothing needed from this PR, just that it may take a little extra time to get it merged.
- All the parts of this workshop should be runnable in the sandbox, this seems like it should work just based on how few resources it uses (and I believe KFTO is enabled by default in RHOAI now). I haven't tested it yet as I hit the error, so just a heads up in case that causes some more needed changes.
Thanks! :)
| "# If Kueue component is enabled then you must create all Kueue related resources (ResourceFlavor, ClusterQueue and LocalQueue) and provide LocalQueue name here.\n", | ||
| "local_queue_name = \"local-queue\"\n", | ||
| "\n", | ||
| "client.create_job(\n", |
There was a problem hiding this comment.
When running this with a PyTorch notebook image version 2025.1 I get this error:
TypeError: TrainingClient.create_job() got an unexpected keyword argument 'labels'
Should a different image be used?
There was a problem hiding this comment.
Hi @RHRolun , Actually the labels parameter is supported in Kubeflow-training SDK latest version 1.9.2
There was a problem hiding this comment.
@sutaakar
IMO this note is needed here, WDYT? :
Note : This workshop requires Red Hat OpenShift AI v2.21+ or Kubeflow Training SDK v1.9.2+
There was a problem hiding this comment.
AFAIK it will require latest KFTO SDK, which should be available in next RHOAI release - 2.22
You can try to manually update KFTO SDK (and restart kernel), which should fix it - pip install -U kubeflow-training
Considering that the PR will get merged after a while, 2.22 should be out at that time, so I think the SDK upgrade doesn't need to be included?
There was a problem hiding this comment.
JFI : The Kubeflow-training SDK v1.9.2 is also available in RHOAI release - 2.21
There was a problem hiding this comment.
Another option is to remove the labels completely as it is needed for Kueue integration.
I will most likely comment it out, adding a note to enable it for environments with Kueue.
|
Sure, will add a documentation draft. |
|
Currently the RHOAI training operator supports only PyTorch, so the example requires PyTorch workbench image. |
|
@RHRolun Made some minor adjustments and added documentation. Will test the script on Sandbox as soon as training operator component gets enabled there. |
Fiona-Waters
left a comment
There was a problem hiding this comment.
Just a few small nits. Thanks @sutaakar
workshop/docs/modules/ROOT/pages/setting-up-kueue-resources.adoc
Outdated
Show resolved
Hide resolved
|
@Fiona-Waters thanks |
Signed-off-by: Karel Suta <ksuta@redhat.com>
Co-authored-by: Fiona Waters <fiwaters6@gmail.com>
|
@RHRolun The example was updated to be able to run on sandbox (needed to reduce CPUs for training workers). |
|
Awesome! :) A couple of things: When trying to uncomment the labels line in the send job section I got this: Also, just out of curiosity, right now all the logs gets posted as a chunk when the job is finished. Is there a way to stream them into the notebook instead? |
Added
This should be addressed by installing the latest KFTO SDK.
Unfortunately streaming logs doesn't seem to be reliable enough - stream gets stuck. It may be better to keep current approach now until the issue is fixed. |
|
I have some doc edits which I will make in a separate PR. |
No description provided.