Skip to content

seanlaii/trainer

This branch is 69 commits behind kubeflow/trainer:master.

Folders and files

NameName
Last commit message
Last commit date
Feb 6, 2025
Feb 14, 2025
Feb 26, 2025
Mar 3, 2025
Feb 6, 2025
Mar 2, 2025
Mar 1, 2025
Mar 1, 2025
Mar 2, 2025
Feb 14, 2025
Jan 11, 2024
Feb 6, 2025
Feb 6, 2025
Feb 4, 2025
Jan 9, 2025
Feb 4, 2025
Jun 28, 2017
Mar 1, 2025
Feb 7, 2025
Feb 14, 2025
Feb 4, 2025
Mar 1, 2025
Mar 1, 2025

Repository files navigation

Kubeflow Trainer

Build Status Coverage Status Go Report Card

logo

Overview

Kubeflow Trainer is a Kubernetes-native project designed for large language models (LLMs) fine-tuning and enabling scalable, distributed training of machine learning (ML) models across various frameworks, including PyTorch, JAX, TensorFlow, and others.

You can integrate other ML libraries such as HuggingFace, DeepSpeed, or Megatron-LM with Kubeflow Training to orchestrate their ML training on Kubernetes.

Kubeflow Trainer allows you effortlessly develop your LLMs with the Kubeflow Python SDK and build Kubernetes-native Training Runtimes with Kubernetes Custom Resources APIs.

logo

Kubeflow Trainer Introduction

The following KubeCon + CloudNativeCon 2024 talk provides an overview of Kubeflow Trainer capabilities:

Kubeflow Trainer

Getting Started

Please check the official Kubeflow documentation to install and get started with Kubeflow Trainer.

Community

The following links provide information on how to get involved in the community:

Contributing

Please refer to the CONTRIBUTING guide.

Changelog

Please refer to the CHANGELOG.

Kubeflow Training Operator V1

Kubeflow Trainer project is currently in alpha status, and APIs may change. If you are using Kubeflow Training Operator V1, please refer to this migration document.

Kubeflow Community will maintain the Training Operator V1 source code at the release-1.9 branch.

You can find the documentation for Kubeflow Training Operator V1 in these guides.

Acknowledgement

This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow Training Operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators.

About

Distributed ML Training and Fine-Tuning on Kubernetes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 51.2%
  • Go 45.4%
  • Shell 2.2%
  • Other 1.2%