Add KFTO Distributed training notebook and scripts by sutaakar · Pull Request #103 · rh-aiservices-bu/fraud-detection

sutaakar · 2025-06-26T09:34:02Z

No description provided.

9_distributed_training_kfto.ipynb

ChughShilpa · 2025-06-27T07:18:15Z

/lgtm

abhijeet-dhumal

/lgtm

RHRolun

Awesome PR!

A few things besides the error I got:

Can you also add a draft for the documentation for this chapter?
This currently relies on the PyTorch notebook image, while the rest of the workshop relies on the TensorFlow image. To unify this we may need to change the workshop to use pytorch instead (I believe that KFTO is pytorch only?). We luckily already have a draft PR for that here, but it needs to go through some work to get merged in and synched with docs. Nothing needed from this PR, just that it may take a little extra time to get it merged.
All the parts of this workshop should be runnable in the sandbox, this seems like it should work just based on how few resources it uses (and I believe KFTO is enabled by default in RHOAI now). I haven't tested it yet as I hit the error, so just a heads up in case that causes some more needed changes.

Thanks! :)

RHRolun · 2025-06-30T07:10:33Z

9_distributed_training_kfto.ipynb

+    "# If Kueue component is enabled then you must create all Kueue related resources (ResourceFlavor, ClusterQueue and LocalQueue) and provide LocalQueue name here.\n",
+    "local_queue_name = \"local-queue\"\n",
+    "\n",
+    "client.create_job(\n",


When running this with a PyTorch notebook image version 2025.1 I get this error:
TypeError: TrainingClient.create_job() got an unexpected keyword argument 'labels'

Should a different image be used?

Hi @RHRolun , Actually the labels parameter is supported in Kubeflow-training SDK latest version 1.9.2

@sutaakar
IMO this note is needed here, WDYT? :

Note : This workshop requires Red Hat OpenShift AI v2.21+ or Kubeflow Training SDK v1.9.2+

AFAIK it will require latest KFTO SDK, which should be available in next RHOAI release - 2.22
You can try to manually update KFTO SDK (and restart kernel), which should fix it - pip install -U kubeflow-training

Considering that the PR will get merged after a while, 2.22 should be out at that time, so I think the SDK upgrade doesn't need to be included?

JFI : The Kubeflow-training SDK v1.9.2 is also available in RHOAI release - 2.21

Another option is to remove the labels completely as it is needed for Kueue integration.
I will most likely comment it out, adding a note to enable it for environments with Kueue.

sutaakar · 2025-06-30T08:18:31Z

Sure, will add a documentation draft.

sutaakar · 2025-06-30T08:19:32Z

Currently the RHOAI training operator supports only PyTorch, so the example requires PyTorch workbench image.

sutaakar · 2025-06-30T14:22:47Z

@RHRolun Made some minor adjustments and added documentation.
I think the Kueue section would be also relevant to Ray training.

Will test the script on Sandbox as soon as training operator component gets enabled there.

Fiona-Waters

Just a few small nits. Thanks @sutaakar

9_distributed_training_kfto.ipynb

workshop/docs/modules/ROOT/pages/setting-up-kueue-resources.adoc

sutaakar · 2025-07-02T08:09:43Z

@Fiona-Waters thanks

Fiona-Waters

/lgtm

Signed-off-by: Karel Suta <ksuta@redhat.com>

Co-authored-by: Fiona Waters <fiwaters6@gmail.com>

sutaakar · 2025-07-02T10:20:09Z

@RHRolun The example was updated to be able to run on sandbox (needed to reduce CPUs for training workers).
It should work now.

RHRolun · 2025-07-14T10:40:27Z

Awesome! :)

A couple of things:
It requires the python dependency kubeflow-training==1.9.0 which is not in all notebook images (for example the tensorflow image that's currently used in the rest of fraud detection).
If we can add either a sentence about it needing to use the PyTorch image or have a pip install in the notebook then it looks good to me (I would lean towards the second option so that it's easier for the users until we get the PyTorch branch merged).

When trying to uncomment the labels line in the send job section I got this:
TypeError: TrainingClient.create_job() got an unexpected keyword argument 'labels'

Also, just out of curiosity, right now all the logs gets posted as a chunk when the job is finished. Is there a way to stream them into the notebook instead?

sutaakar · 2025-07-15T07:56:32Z

It requires the python dependency kubeflow-training==1.9.0 which is not in all notebook images (for example the tensorflow image that's currently used in the rest of fraud detection).
If we can add either a sentence about it needing to use the PyTorch image or have a pip install in the notebook then it looks good to me (I would lean towards the second option so that it's easier for the users until we get the PyTorch branch merged).

Added pip install to install latest KFTO SDK.

When trying to uncomment the labels line in the send job section I got this:
TypeError: TrainingClient.create_job() got an unexpected keyword argument 'labels'

This should be addressed by installing the latest KFTO SDK.

Also, just out of curiosity, right now all the logs gets posted as a chunk when the job is finished. Is there a way to stream them into the notebook instead?

Unfortunately streaming logs doesn't seem to be reliable enough - stream gets stuck. It may be better to keep current approach now until the issue is fixed.

MelissaFlinn · 2025-07-18T12:25:08Z

I have some doc edits which I will make in a separate PR.

ChughShilpa reviewed Jun 26, 2025

View reviewed changes

9_distributed_training_kfto.ipynb Show resolved Hide resolved

abhijeet-dhumal approved these changes Jun 30, 2025

View reviewed changes

RHRolun reviewed Jun 30, 2025

View reviewed changes

sutaakar force-pushed the adjustments branch from e641ca3 to 414668b Compare June 30, 2025 14:21

Fiona-Waters reviewed Jul 1, 2025

View reviewed changes

Fiona-Waters reviewed Jul 2, 2025

View reviewed changes

RHRolun and others added 2 commits July 2, 2025 12:15

Add KFTO Distributed training notebook and scripts

ffecf4e

Signed-off-by: Karel Suta <ksuta@redhat.com>

Apply suggestions from code review

28b2817

Co-authored-by: Fiona Waters <fiwaters6@gmail.com>

sutaakar force-pushed the adjustments branch from 1bd405e to 28b2817 Compare July 2, 2025 10:15

Addressed feedback from PR

9597e95

sutaakar force-pushed the adjustments branch from 9b938d1 to 9597e95 Compare July 15, 2025 07:53

Adjust KFTO PyTorchJob waiting to be more interactive

ab3fbee

RHRolun merged commit 36d216f into rh-aiservices-bu:main Jul 18, 2025

sutaakar deleted the adjustments branch July 21, 2025 11:25

Conversation

sutaakar commented Jun 26, 2025

Uh oh!

Uh oh!

ChughShilpa commented Jun 27, 2025

Uh oh!

abhijeet-dhumal left a comment

Choose a reason for hiding this comment

Uh oh!

RHRolun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RHRolun Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

sutaakar Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

sutaakar Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

sutaakar commented Jun 30, 2025

Uh oh!

sutaakar commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sutaakar commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fiona-Waters left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sutaakar commented Jul 2, 2025

Uh oh!

Fiona-Waters left a comment

Choose a reason for hiding this comment

Uh oh!

sutaakar commented Jul 2, 2025

Uh oh!

RHRolun commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sutaakar commented Jul 15, 2025

Uh oh!

MelissaFlinn commented Jul 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

RHRolun left a comment •

edited

Loading

sutaakar commented Jun 30, 2025 •

edited

Loading

sutaakar commented Jun 30, 2025 •

edited

Loading

RHRolun commented Jul 14, 2025 •

edited

Loading