Kubeflow Trainer

Overview

Kubeflow Trainer is a Kubernetes-native project designed for large language models (LLMs) fine-tuning and enabling scalable, distributed training of machine learning (ML) models across various frameworks, including PyTorch, JAX, TensorFlow, and others.

You can integrate other ML libraries such as HuggingFace, DeepSpeed, or Megatron-LM with Kubeflow Training to orchestrate their ML training on Kubernetes.

Kubeflow Trainer allows you effortlessly develop your LLMs with the Kubeflow Python SDK and build Kubernetes-native Training Runtimes with Kubernetes Custom Resources APIs.

Kubeflow Trainer Introduction

The following KubeCon + CloudNativeCon 2024 talk provides an overview of Kubeflow Trainer capabilities:

Getting Started

Please check the official Kubeflow documentation to install and get started with Kubeflow Trainer.

Community

The following links provide information on how to get involved in the community:

Join our #kubeflow-training Slack channel.
Attend the bi-weekly AutoML and Training Working Group community meeting.
Check out who is using Kubeflow Trainer.

Contributing

Please refer to the CONTRIBUTING guide.

Changelog

Please refer to the CHANGELOG.

Kubeflow Training Operator V1

Kubeflow Trainer project is currently in alpha status, and APIs may change. If you are using Kubeflow Training Operator V1, please refer to this migration document.

Kubeflow Community will maintain the Training Operator V1 source code at the release-1.9 branch.

You can find the documentation for Kubeflow Training Operator V1 in these guides.

Acknowledgement

This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow Training Operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators.

PyTorch Operator: list of contributors and maintainers.
MPI Operator: list of contributors and maintainers.
XGBoost Operator: list of contributors and maintainers.
Common library: list of contributors and maintainers.

Name	Name	Last commit message	Last commit date
Latest commit tenzen-y KEP-2170: Add Kubeflow Trainer Pipeline Framework Design (kubeflow#2439 ) Mar 3, 2025 3ec8f07 · Mar 3, 2025 History 1,152 Commits
.github	.github	Update the naming conventions for Kubeflow Trainer (kubeflow#2415 )	Feb 6, 2025
api/openapi-spec	api/openapi-spec	KEP-2170: Add validation to Torch `numProcPerNode` field (kubeflow#2409 )	Feb 14, 2025
cmd	cmd	KEP-2170: Use SSA to reconcile TrainJob components (kubeflow#2431 )	Feb 26, 2025
docs	docs	KEP-2170: Add Kubeflow Trainer Pipeline Framework Design (kubeflow#2439 )	Mar 3, 2025
examples	examples	Update the naming conventions for Kubeflow Trainer (kubeflow#2415 )	Feb 6, 2025
hack	hack	fix(sdk): resolve errors in deserialization (kubeflow#2457 )	Mar 2, 2025
manifests	manifests	Bump JobSet to v0.8.0 (kubeflow#2463 )	Mar 1, 2025
pkg	pkg	Bump JobSet to v0.8.0 (kubeflow#2463 )	Mar 1, 2025
sdk	sdk	fix(sdk): resolve errors in deserialization (kubeflow#2457 )	Mar 2, 2025
test	test	KEP-2170: Add validation to Torch `numProcPerNode` field (kubeflow#2409 )	Feb 14, 2025
.flake8	.flake8	[SDK] Get Kubernetes Events for Job (kubeflow#1975 )	Jan 11, 2024
.gitignore	.gitignore	Update the naming conventions for Kubeflow Trainer (kubeflow#2415 )	Feb 6, 2025
.pre-commit-config.yaml	.pre-commit-config.yaml	Update the naming conventions for Kubeflow Trainer (kubeflow#2415 )	Feb 6, 2025
ADOPTERS.md	ADOPTERS.md	Remove the Training Operator V1 Source Code (kubeflow#2389 )	Feb 4, 2025
CHANGELOG.md	CHANGELOG.md	Add Changelog for Training Operator v1.9.0-rc.0 (kubeflow#2380 )	Jan 9, 2025
CONTRIBUTING.md	CONTRIBUTING.md	Remove the Training Operator V1 Source Code (kubeflow#2389 )	Feb 4, 2025
LICENSE	LICENSE	Initial commit	Jun 28, 2017
Makefile	Makefile	Bump JobSet to v0.8.0 (kubeflow#2463 )	Mar 1, 2025
OWNERS	OWNERS	Nominate @Electronic-Waste as a reviewer (kubeflow#2427 )	Feb 7, 2025
README.md	README.md	update migration url on readme file (kubeflow#2436 )	Feb 14, 2025
ROADMAP.md	ROADMAP.md	Remove the Training Operator V1 Source Code (kubeflow#2389 )	Feb 4, 2025
go.mod	go.mod	Bump JobSet to v0.8.0 (kubeflow#2463 )	Mar 1, 2025
go.sum	go.sum	Bump JobSet to v0.8.0 (kubeflow#2463 )	Mar 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kubeflow Trainer

Overview

Kubeflow Trainer Introduction

Getting Started

Community

Contributing

Changelog

Kubeflow Training Operator V1

Acknowledgement

About

Releases

Packages

Languages

License

seanlaii/trainer

Folders and files

Latest commit

History

Repository files navigation

Kubeflow Trainer

Overview

Kubeflow Trainer Introduction

Getting Started

Community

Contributing

Changelog

Kubeflow Training Operator V1

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages