Skip to content

kozistr/pytorch_optimizer

pytorch-optimizer

CI Docs PyPI Python Codecov License Total Downloads Monthly Downloads

pytorch-optimizer is a production-focused optimization toolkit for PyTorch with 100+ optimizers, 10+ learning rate schedulers, and 10+ loss functions behind a consistent API.

Use it when you want fast experimentation with modern training methods without rewriting optimizer boilerplate.

Highly inspired by jettify/pytorch-optimizer.

Why pytorch-optimizer

  • Broad optimizer coverage, including many recent research variants.
  • Consistent loader APIs for optimizers, schedulers, and losses.
  • Practical features such as foreach, Lookahead, and Gradient Centralization integrations.
  • Tested and actively maintained codebase.
  • Works with optional ecosystem integrations like bitsandbytes, q-galore-torch, and torchao.

Installation

Requirements:

  • Python >=3.8
  • PyTorch >=1.10
pip install pytorch-optimizer

Optional integrations are not installed by default:

Quick Start

1) Use an optimizer class directly

from pytorch_optimizer import AdamP

model = YourModel()
optimizer = AdamP(model.parameters(), lr=1e-3)

2) Load by name

from pytorch_optimizer import load_optimizer

model = YourModel()
optimizer = load_optimizer('adamp')(model.parameters(), lr=1e-3)

3) Build with create_optimizer()

from pytorch_optimizer import create_optimizer

model = YourModel()
optimizer = create_optimizer(
    model,
    optimizer_name='adamp',
    lr=1e-3,
    weight_decay=1e-3,
    use_gc=True,
    use_lookahead=True,
)

4) Optional: load via torch.hub

import torch

model = YourModel()
opt_cls = torch.hub.load('kozistr/pytorch_optimizer', 'adamp')
optimizer = opt_cls(model.parameters(), lr=1e-3)

Discover Available Components

Optimizers

from pytorch_optimizer import get_supported_optimizers

all_optimizers = get_supported_optimizers()
adam_family = get_supported_optimizers('adam*')
selected = get_supported_optimizers(['adam*', 'ranger*'])

Learning Rate Schedulers

from pytorch_optimizer import get_supported_lr_schedulers

all_schedulers = get_supported_lr_schedulers()
cosine_like = get_supported_lr_schedulers('cosine*')

Loss Functions

from pytorch_optimizer import get_supported_loss_functions

all_losses = get_supported_loss_functions()
focal_related = get_supported_loss_functions('*focal*')

Supported Optimizers

You can check the supported optimizers with below code.

from pytorch_optimizer import get_supported_optimizers

supported_optimizers = get_supported_optimizers()

or you can also search them with the filter(s).

from pytorch_optimizer import get_supported_optimizers

get_supported_optimizers('adam*')
# ['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw']

get_supported_optimizers(['adam*', 'ranger*'])
# ['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw', 'ranger', 'ranger21']
Optimizer Description Official Code Paper(Citation)
AdaBelief Adapting Step-sizes by the Belief in Observed Gradients github paper(cite)
AdaBound Adaptive Gradient Methods with Dynamic Bound of Learning Rate github paper(cite)
AdaHessian An Adaptive Second Order Optimizer for Machine Learning github paper(cite)
AdamD Improved bias-correction in Adam paper(cite)
AdamP Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights github paper(cite)
diffGrad An Optimization Method for Convolutional Neural Networks github paper(cite)
MADGRAD A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic github paper(cite)
RAdam On the Variance of the Adaptive Learning Rate and Beyond github paper(cite)
Ranger a synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizer github paper(cite)
Ranger21 a synergistic deep learning optimizer github paper(cite)
Lamb Large Batch Optimization for Deep Learning github paper(cite)
Shampoo Preconditioned Stochastic Tensor Optimization github paper(cite)
Nero Learning by Turning: Neural Architecture Aware Optimisation github paper(cite)
Adan Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models github paper(cite)
Adai Disentangling the Effects of Adaptive Learning Rate and Momentum github paper(cite)
SAM Sharpness-Aware Minimization github paper(cite)
ASAM Adaptive Sharpness-Aware Minimization github paper(cite)
GSAM Surrogate Gap Guided Sharpness-Aware Minimization github paper(cite)
D-Adaptation Learning-Rate-Free Learning by D-Adaptation github paper(cite)
AdaFactor Adaptive Learning Rates with Sublinear Memory Cost github paper(cite)
Apollo An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization github paper(cite)
NovoGrad Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks github paper(cite)
Lion Symbolic Discovery of Optimization Algorithms github paper(cite)
Ali-G Adaptive Learning Rates for Interpolation with Gradients github paper(cite)
SM3 Memory-Efficient Adaptive Optimization github paper(cite)
AdaNorm Adaptive Gradient Norm Correction based Optimizer for CNNs github paper(cite)
RotoGrad Gradient Homogenization in Multitask Learning github paper(cite)
A2Grad Optimal Adaptive and Accelerated Stochastic Gradient Descent github paper(cite)
AccSGD Accelerating Stochastic Gradient Descent For Least Squares Regression github paper(cite)
SGDW Decoupled Weight Decay Regularization github paper(cite)
ASGD Adaptive Gradient Descent without Descent github paper(cite)
Yogi Adaptive Methods for Nonconvex Optimization paper(cite)
SWATS Improving Generalization Performance by Switching from Adam to SGD paper(cite)
Fromage On the distance between two neural networks and the stability of learning github paper(cite)
MSVAG Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients github paper(cite)
AdaMod An Adaptive and Momental Bound Method for Stochastic Learning github paper(cite)
AggMo Aggregated Momentum: Stability Through Passive Damping github paper(cite)
QHAdam Quasi-hyperbolic momentum and Adam for deep learning github paper(cite)
PID A PID Controller Approach for Stochastic Optimization of Deep Networks github paper(cite)
Gravity a Kinematic Approach on Optimization in Deep Learning github paper(cite)
AdaSmooth An Adaptive Learning Rate Method based on Effective Ratio paper(cite)
SRMM Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates github paper(cite)
AvaGrad Domain-independent Dominance of Adaptive Methods github paper(cite)
PCGrad Gradient Surgery for Multi-Task Learning github paper(cite)
AMSGrad On the Convergence of Adam and Beyond paper(cite)
Lookahead k steps forward, 1 step back github paper(cite)
PNM Manipulating Stochastic Gradient Noise to Improve Generalization github paper(cite)
GC Gradient Centralization github paper(cite)
AGC Adaptive Gradient Clipping github paper(cite)
Stable WD Understanding and Scheduling Weight Decay github paper(cite)
Softplus T Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM paper(cite)
Un-tuned w/u On the adequacy of untuned warmup for adaptive optimization paper(cite)
Norm Loss An efficient yet effective regularization method for deep neural networks paper(cite)
AdaShift Decorrelation and Convergence of Adaptive Learning Rate Methods github paper(cite)
AdaDelta An Adaptive Learning Rate Method paper(cite)
Amos An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale github paper(cite)
SignSGD Compressed Optimisation for Non-Convex Problems github paper(cite)
Sophia A Scalable Stochastic Second-order Optimizer for Language Model Pre-training github paper(cite)
Prodigy An Expeditiously Adaptive Parameter-Free Learner github paper(cite)
PAdam Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks github paper(cite)
LOMO Full Parameter Fine-tuning for Large Language Models with Limited Resources github paper(cite)
AdaLOMO Low-memory Optimization with Adaptive Learning Rate github paper(cite)
Tiger A Tight-fisted Optimizer, an optimizer that is extremely budget-conscious github cite
CAME Confidence-guided Adaptive Memory Efficient Optimization github paper(cite)
WSAM Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term github paper(cite)
Aida A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range github paper(cite)
GaLore Memory-Efficient LLM Training by Gradient Low-Rank Projection github paper(cite)
Adalite Adalite optimizer github paper(cite)
bSAM SAM as an Optimal Relaxation of Bayes github paper(cite)
Schedule-Free Schedule-Free Optimizers github paper(cite)
FAdam Adam is a natural gradient optimizer using diagonal empirical Fisher information github paper(cite)
Grokfast Accelerated Grokking by Amplifying Slow Gradients github paper(cite)
Kate Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad github paper(cite)
StableAdamW Stable and low-precision training for large-scale vision-language models paper(cite)
AdamMini Use Fewer Learning Rates To Gain More github paper(cite)
TRAC Adaptive Parameter-free Optimization github paper(cite)
AdamG Towards Stability of Parameter-free Optimization paper(cite)
AdEMAMix Better, Faster, Older github paper(cite)
SOAP Improving and Stabilizing Shampoo using Adam github paper(cite)
ADOPT Modified Adam Can Converge with Any β2 with the Optimal Rate github paper(cite)
FTRL Follow The Regularized Leader paper
Cautious Improving Training with One Line of Code github paper(cite)
DeMo Decoupled Momentum Optimization github paper(cite)
MicroAdam Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence github paper(cite)
Muon MomentUm Orthogonalized by Newton-schulz github paper(cite)
LaProp Separating Momentum and Adaptivity in Adam github paper(cite)
APOLLO SGD-like Memory, AdamW-level Performance github paper(cite)
MARS Unleashing the Power of Variance Reduction for Training Large Models github paper(cite)
SGDSaI No More Adam: Learning Rate Scaling at Initialization is All You Need github paper(cite)
Grams Gradient Descent with Adaptive Momentum Scaling paper(cite)
OrthoGrad Grokking at the Edge of Numerical Stability github paper(cite)
Adam-ATAN2 Scaling Exponents Across Parameterizations and Optimizers paper(cite)
SPAM Spike-Aware Adam with Momentum Reset for Stable LLM Training github paper(cite)
TAM Torque-Aware Momentum paper(cite)
FOCUS First Order Concentrated Updating Scheme github paper(cite)
PSGD Preconditioned Stochastic Gradient Descent github paper(cite)
EXAdam The Power of Adaptive Cross-Moments github paper(cite)
GCSAM Gradient Centralized Sharpness Aware Minimization github paper(cite)
LookSAM Towards Efficient and Scalable Sharpness-Aware Minimization github paper(cite)
SCION Training Deep Learning Models with Norm-Constrained LMOs github paper(cite)
COSMOS SOAP with Muon github
StableSPAM How to Train in 4-Bit More Stably than 16-Bit Adam github paper
AdaGC Improving Training Stability for Large Language Model Pretraining paper(cite)
Simplified-Ademamix Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants github paper(cite)
Fira Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint? github paper(cite)
RACS & Alice Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension paper(cite)
VSGD Variational Stochastic Gradient Descent for Deep Neural Networks github paper(cite)
SNSM Subset-Norm and Subspace-Momentum: Faster Memory-Efficient Adaptive Optimization with Convergence Guarantees github paper(cite)
AdamC Why Gradients Rapidly Increase Near the End of Training paper(cite)
AdaMuon Adaptive Muon Optimizer paper(cite)
SPlus A Stable Whitening Optimizer for Efficient Neural Network Training github paper(cite)
EmoNavi An emotion-driven optimizer that feels loss and navigates accordingly github
Refined Schedule-Free Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training paper(cite)
FriendlySAM Friendly Sharpness-Aware Minimization github paper(cite)
AdaGO AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates paper(cite)
Conda Column-Normalized Adam for Training Large Language Models Faster github paper(cite)
BCOS Stochastic Approximation with Block Coordinate Optimal Stepsizes github paper(cite)
Cautious WD Cautious Weight Decay paper(cite)
Ano Faster is Better in Noisy Landscape github paper(cite)
Spectral Sphere Controlled LLM Training on Spectral Sphere github paper(cite)

Supported LR Scheduler

You can check the supported learning rate schedulers with below code.

from pytorch_optimizer import get_supported_lr_schedulers

supported_lr_schedulers = get_supported_lr_schedulers()

or you can also search them with the filter(s).

from pytorch_optimizer import get_supported_lr_schedulers

get_supported_lr_schedulers('cosine*')
# ['cosine', 'cosine_annealing', 'cosine_annealing_with_warm_restart', 'cosine_annealing_with_warmup']

get_supported_lr_schedulers(['cosine*', '*warm*'])
# ['cosine', 'cosine_annealing', 'cosine_annealing_with_warm_restart', 'cosine_annealing_with_warmup', 'warmup_stable_decay']
LR Scheduler Description Official Code Paper(Citation)
Explore-Exploit Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule paper(cite)
Chebyshev Acceleration via Fractal Learning Rate Schedules paper(cite)
REX Revisiting Budgeted Training with an Improved Schedule github paper(cite)
WSD Warmup-Stable-Decay learning rate scheduler github paper(cite)

Supported Loss Function

You can check the supported loss functions with below code.

from pytorch_optimizer import get_supported_loss_functions

supported_loss_functions = get_supported_loss_functions()

or you can also search them with the filter(s).

from pytorch_optimizer import get_supported_loss_functions

get_supported_loss_functions('*focal*')
# ['bcefocalloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']

get_supported_loss_functions(['*focal*', 'bce*'])
# ['bcefocalloss', 'bceloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']
Loss Functions Description Official Code Paper(Citation)
Label Smoothing Rethinking the Inception Architecture for Computer Vision paper(cite)
Focal Focal Loss for Dense Object Detection paper(cite)
Focal Cosine Data-Efficient Deep Learning Method for Image Classification Using Data Augmentation, Focal Cosine Loss, and Ensemble paper(cite)
LDAM Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss github paper(cite)
Jaccard (IOU) IoU Loss for 2D/3D Object Detection paper(cite)
Bi-Tempered The Principle of Unchanged Optimality in Reinforcement Learning Generalization paper(cite)
Tversky Tversky loss function for image segmentation using 3D fully convolutional deep networks paper(cite)
Lovasz Hinge A tractable surrogate for the optimization of the intersection-over-union measure in neural networks github paper(cite)

Documentation

License Notes

Most implementations are under MIT or Apache 2.0 compatible terms from their original sources. Some algorithms (for example Fromage, Nero) are tied to CC BY-NC-SA 4.0, which is non-commercial. Please verify the license of each optimizer before production or commercial use.

Contributing and Community

Citation

Please cite original optimizer authors when you use specific algorithms. If you use this repository, you can use the citation metadata in CITATION.cff or GitHub's "Cite this repository".

@software{Kim_pytorch_optimizer_optimizer_2021,
  author = {Kim, Hyeongchan},
  title = {{pytorch_optimizer: optimizer & lr scheduler & loss function collections in PyTorch}},
  url = {https://github.com/kozistr/pytorch_optimizer},
  year = {2021}
}

Maintainer

Hyeongchan Kim / @kozistr