Skip to content

Add KFTO Distributed training notebook and scripts#103

Merged
RHRolun merged 4 commits intorh-aiservices-bu:mainfrom
sutaakar:adjustments
Jul 18, 2025
Merged

Add KFTO Distributed training notebook and scripts#103
RHRolun merged 4 commits intorh-aiservices-bu:mainfrom
sutaakar:adjustments

Conversation

@sutaakar
Copy link
Copy Markdown
Contributor

No description provided.

@ChughShilpa
Copy link
Copy Markdown

/lgtm

Copy link
Copy Markdown

@abhijeet-dhumal abhijeet-dhumal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Copy Markdown
Contributor

@RHRolun RHRolun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome PR!

A few things besides the error I got:

  • Can you also add a draft for the documentation for this chapter?
  • This currently relies on the PyTorch notebook image, while the rest of the workshop relies on the TensorFlow image. To unify this we may need to change the workshop to use pytorch instead (I believe that KFTO is pytorch only?). We luckily already have a draft PR for that here, but it needs to go through some work to get merged in and synched with docs. Nothing needed from this PR, just that it may take a little extra time to get it merged.
  • All the parts of this workshop should be runnable in the sandbox, this seems like it should work just based on how few resources it uses (and I believe KFTO is enabled by default in RHOAI now). I haven't tested it yet as I hit the error, so just a heads up in case that causes some more needed changes.

Thanks! :)

"# If Kueue component is enabled then you must create all Kueue related resources (ResourceFlavor, ClusterQueue and LocalQueue) and provide LocalQueue name here.\n",
"local_queue_name = \"local-queue\"\n",
"\n",
"client.create_job(\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running this with a PyTorch notebook image version 2025.1 I get this error:
TypeError: TrainingClient.create_job() got an unexpected keyword argument 'labels'

Should a different image be used?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @RHRolun , Actually the labels parameter is supported in Kubeflow-training SDK latest version 1.9.2

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sutaakar
IMO this note is needed here, WDYT? :

Note : This workshop requires Red Hat OpenShift AI v2.21+ or Kubeflow Training SDK v1.9.2+

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK it will require latest KFTO SDK, which should be available in next RHOAI release - 2.22
You can try to manually update KFTO SDK (and restart kernel), which should fix it - pip install -U kubeflow-training

Considering that the PR will get merged after a while, 2.22 should be out at that time, so I think the SDK upgrade doesn't need to be included?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JFI : The Kubeflow-training SDK v1.9.2 is also available in RHOAI release - 2.21

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to remove the labels completely as it is needed for Kueue integration.
I will most likely comment it out, adding a note to enable it for environments with Kueue.

@sutaakar
Copy link
Copy Markdown
Contributor Author

Sure, will add a documentation draft.

@sutaakar
Copy link
Copy Markdown
Contributor Author

sutaakar commented Jun 30, 2025

Currently the RHOAI training operator supports only PyTorch, so the example requires PyTorch workbench image.

@sutaakar
Copy link
Copy Markdown
Contributor Author

sutaakar commented Jun 30, 2025

@RHRolun Made some minor adjustments and added documentation.
I think the Kueue section would be also relevant to Ray training.

Will test the script on Sandbox as soon as training operator component gets enabled there.

Copy link
Copy Markdown
Contributor

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small nits. Thanks @sutaakar

@sutaakar
Copy link
Copy Markdown
Contributor Author

sutaakar commented Jul 2, 2025

@Fiona-Waters thanks

Copy link
Copy Markdown
Contributor

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

RHRolun and others added 2 commits July 2, 2025 12:15
Signed-off-by: Karel Suta <ksuta@redhat.com>
Co-authored-by: Fiona Waters <fiwaters6@gmail.com>
@sutaakar
Copy link
Copy Markdown
Contributor Author

sutaakar commented Jul 2, 2025

@RHRolun The example was updated to be able to run on sandbox (needed to reduce CPUs for training workers).
It should work now.

@RHRolun
Copy link
Copy Markdown
Contributor

RHRolun commented Jul 14, 2025

Awesome! :)

A couple of things:
It requires the python dependency kubeflow-training==1.9.0 which is not in all notebook images (for example the tensorflow image that's currently used in the rest of fraud detection).
If we can add either a sentence about it needing to use the PyTorch image or have a pip install in the notebook then it looks good to me (I would lean towards the second option so that it's easier for the users until we get the PyTorch branch merged).

When trying to uncomment the labels line in the send job section I got this:
TypeError: TrainingClient.create_job() got an unexpected keyword argument 'labels'

Also, just out of curiosity, right now all the logs gets posted as a chunk when the job is finished. Is there a way to stream them into the notebook instead?

@sutaakar
Copy link
Copy Markdown
Contributor Author

It requires the python dependency kubeflow-training==1.9.0 which is not in all notebook images (for example the tensorflow image that's currently used in the rest of fraud detection).
If we can add either a sentence about it needing to use the PyTorch image or have a pip install in the notebook then it looks good to me (I would lean towards the second option so that it's easier for the users until we get the PyTorch branch merged).

Added pip install to install latest KFTO SDK.

When trying to uncomment the labels line in the send job section I got this:
TypeError: TrainingClient.create_job() got an unexpected keyword argument 'labels'

This should be addressed by installing the latest KFTO SDK.

Also, just out of curiosity, right now all the logs gets posted as a chunk when the job is finished. Is there a way to stream them into the notebook instead?

Unfortunately streaming logs doesn't seem to be reliable enough - stream gets stuck. It may be better to keep current approach now until the issue is fixed.

@MelissaFlinn
Copy link
Copy Markdown
Collaborator

I have some doc edits which I will make in a separate PR.

@RHRolun RHRolun merged commit 36d216f into rh-aiservices-bu:main Jul 18, 2025
@sutaakar sutaakar deleted the adjustments branch July 21, 2025 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants