pytorch-optimizer is a production-focused optimization toolkit for PyTorch with 100+ optimizers, 10+ learning rate schedulers, and 10+ loss functions behind a consistent API.
Use it when you want fast experimentation with modern training methods without rewriting optimizer boilerplate.
Highly inspired by jettify/pytorch-optimizer.
- Broad optimizer coverage, including many recent research variants.
- Consistent loader APIs for optimizers, schedulers, and losses.
- Practical features such as
foreach,Lookahead, andGradient Centralizationintegrations. - Tested and actively maintained codebase.
- Works with optional ecosystem integrations like
bitsandbytes,q-galore-torch, andtorchao.
Requirements:
- Python
>=3.8 - PyTorch
>=1.10
pip install pytorch-optimizerOptional integrations are not installed by default:
bitsandbytes: https://github.com/TimDettmers/bitsandbytes?tab=readme-ov-file#tldrq-galore-torch: https://github.com/VITA-Group/Q-GaLore?tab=readme-ov-file#install-q-galore-optimizertorchao: https://github.com/pytorch/ao?tab=readme-ov-file#installation
from pytorch_optimizer import AdamP
model = YourModel()
optimizer = AdamP(model.parameters(), lr=1e-3)from pytorch_optimizer import load_optimizer
model = YourModel()
optimizer = load_optimizer('adamp')(model.parameters(), lr=1e-3)from pytorch_optimizer import create_optimizer
model = YourModel()
optimizer = create_optimizer(
model,
optimizer_name='adamp',
lr=1e-3,
weight_decay=1e-3,
use_gc=True,
use_lookahead=True,
)import torch
model = YourModel()
opt_cls = torch.hub.load('kozistr/pytorch_optimizer', 'adamp')
optimizer = opt_cls(model.parameters(), lr=1e-3)from pytorch_optimizer import get_supported_optimizers
all_optimizers = get_supported_optimizers()
adam_family = get_supported_optimizers('adam*')
selected = get_supported_optimizers(['adam*', 'ranger*'])from pytorch_optimizer import get_supported_lr_schedulers
all_schedulers = get_supported_lr_schedulers()
cosine_like = get_supported_lr_schedulers('cosine*')from pytorch_optimizer import get_supported_loss_functions
all_losses = get_supported_loss_functions()
focal_related = get_supported_loss_functions('*focal*')You can check the supported optimizers with below code.
from pytorch_optimizer import get_supported_optimizers
supported_optimizers = get_supported_optimizers()or you can also search them with the filter(s).
from pytorch_optimizer import get_supported_optimizers
get_supported_optimizers('adam*')
# ['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw']
get_supported_optimizers(['adam*', 'ranger*'])
# ['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw', 'ranger', 'ranger21']| Optimizer | Description | Official Code | Paper(Citation) |
|---|---|---|---|
| AdaBelief | Adapting Step-sizes by the Belief in Observed Gradients | github | paper(cite) |
| AdaBound | Adaptive Gradient Methods with Dynamic Bound of Learning Rate | github | paper(cite) |
| AdaHessian | An Adaptive Second Order Optimizer for Machine Learning | github | paper(cite) |
| AdamD | Improved bias-correction in Adam | paper(cite) | |
| AdamP | Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights | github | paper(cite) |
| diffGrad | An Optimization Method for Convolutional Neural Networks | github | paper(cite) |
| MADGRAD | A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic | github | paper(cite) |
| RAdam | On the Variance of the Adaptive Learning Rate and Beyond | github | paper(cite) |
| Ranger | a synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizer | github | paper(cite) |
| Ranger21 | a synergistic deep learning optimizer | github | paper(cite) |
| Lamb | Large Batch Optimization for Deep Learning | github | paper(cite) |
| Shampoo | Preconditioned Stochastic Tensor Optimization | github | paper(cite) |
| Nero | Learning by Turning: Neural Architecture Aware Optimisation | github | paper(cite) |
| Adan | Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models | github | paper(cite) |
| Adai | Disentangling the Effects of Adaptive Learning Rate and Momentum | github | paper(cite) |
| SAM | Sharpness-Aware Minimization | github | paper(cite) |
| ASAM | Adaptive Sharpness-Aware Minimization | github | paper(cite) |
| GSAM | Surrogate Gap Guided Sharpness-Aware Minimization | github | paper(cite) |
| D-Adaptation | Learning-Rate-Free Learning by D-Adaptation | github | paper(cite) |
| AdaFactor | Adaptive Learning Rates with Sublinear Memory Cost | github | paper(cite) |
| Apollo | An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization | github | paper(cite) |
| NovoGrad | Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks | github | paper(cite) |
| Lion | Symbolic Discovery of Optimization Algorithms | github | paper(cite) |
| Ali-G | Adaptive Learning Rates for Interpolation with Gradients | github | paper(cite) |
| SM3 | Memory-Efficient Adaptive Optimization | github | paper(cite) |
| AdaNorm | Adaptive Gradient Norm Correction based Optimizer for CNNs | github | paper(cite) |
| RotoGrad | Gradient Homogenization in Multitask Learning | github | paper(cite) |
| A2Grad | Optimal Adaptive and Accelerated Stochastic Gradient Descent | github | paper(cite) |
| AccSGD | Accelerating Stochastic Gradient Descent For Least Squares Regression | github | paper(cite) |
| SGDW | Decoupled Weight Decay Regularization | github | paper(cite) |
| ASGD | Adaptive Gradient Descent without Descent | github | paper(cite) |
| Yogi | Adaptive Methods for Nonconvex Optimization | paper(cite) | |
| SWATS | Improving Generalization Performance by Switching from Adam to SGD | paper(cite) | |
| Fromage | On the distance between two neural networks and the stability of learning | github | paper(cite) |
| MSVAG | Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients | github | paper(cite) |
| AdaMod | An Adaptive and Momental Bound Method for Stochastic Learning | github | paper(cite) |
| AggMo | Aggregated Momentum: Stability Through Passive Damping | github | paper(cite) |
| QHAdam | Quasi-hyperbolic momentum and Adam for deep learning | github | paper(cite) |
| PID | A PID Controller Approach for Stochastic Optimization of Deep Networks | github | paper(cite) |
| Gravity | a Kinematic Approach on Optimization in Deep Learning | github | paper(cite) |
| AdaSmooth | An Adaptive Learning Rate Method based on Effective Ratio | paper(cite) | |
| SRMM | Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates | github | paper(cite) |
| AvaGrad | Domain-independent Dominance of Adaptive Methods | github | paper(cite) |
| PCGrad | Gradient Surgery for Multi-Task Learning | github | paper(cite) |
| AMSGrad | On the Convergence of Adam and Beyond | paper(cite) | |
| Lookahead | k steps forward, 1 step back | github | paper(cite) |
| PNM | Manipulating Stochastic Gradient Noise to Improve Generalization | github | paper(cite) |
| GC | Gradient Centralization | github | paper(cite) |
| AGC | Adaptive Gradient Clipping | github | paper(cite) |
| Stable WD | Understanding and Scheduling Weight Decay | github | paper(cite) |
| Softplus T | Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM | paper(cite) | |
| Un-tuned w/u | On the adequacy of untuned warmup for adaptive optimization | paper(cite) | |
| Norm Loss | An efficient yet effective regularization method for deep neural networks | paper(cite) | |
| AdaShift | Decorrelation and Convergence of Adaptive Learning Rate Methods | github | paper(cite) |
| AdaDelta | An Adaptive Learning Rate Method | paper(cite) | |
| Amos | An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale | github | paper(cite) |
| SignSGD | Compressed Optimisation for Non-Convex Problems | github | paper(cite) |
| Sophia | A Scalable Stochastic Second-order Optimizer for Language Model Pre-training | github | paper(cite) |
| Prodigy | An Expeditiously Adaptive Parameter-Free Learner | github | paper(cite) |
| PAdam | Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks | github | paper(cite) |
| LOMO | Full Parameter Fine-tuning for Large Language Models with Limited Resources | github | paper(cite) |
| AdaLOMO | Low-memory Optimization with Adaptive Learning Rate | github | paper(cite) |
| Tiger | A Tight-fisted Optimizer, an optimizer that is extremely budget-conscious | github | cite |
| CAME | Confidence-guided Adaptive Memory Efficient Optimization | github | paper(cite) |
| WSAM | Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term | github | paper(cite) |
| Aida | A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range | github | paper(cite) |
| GaLore | Memory-Efficient LLM Training by Gradient Low-Rank Projection | github | paper(cite) |
| Adalite | Adalite optimizer | github | paper(cite) |
| bSAM | SAM as an Optimal Relaxation of Bayes | github | paper(cite) |
| Schedule-Free | Schedule-Free Optimizers | github | paper(cite) |
| FAdam | Adam is a natural gradient optimizer using diagonal empirical Fisher information | github | paper(cite) |
| Grokfast | Accelerated Grokking by Amplifying Slow Gradients | github | paper(cite) |
| Kate | Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad | github | paper(cite) |
| StableAdamW | Stable and low-precision training for large-scale vision-language models | paper(cite) | |
| AdamMini | Use Fewer Learning Rates To Gain More | github | paper(cite) |
| TRAC | Adaptive Parameter-free Optimization | github | paper(cite) |
| AdamG | Towards Stability of Parameter-free Optimization | paper(cite) | |
| AdEMAMix | Better, Faster, Older | github | paper(cite) |
| SOAP | Improving and Stabilizing Shampoo using Adam | github | paper(cite) |
| ADOPT | Modified Adam Can Converge with Any β2 with the Optimal Rate | github | paper(cite) |
| FTRL | Follow The Regularized Leader | paper | |
| Cautious | Improving Training with One Line of Code | github | paper(cite) |
| DeMo | Decoupled Momentum Optimization | github | paper(cite) |
| MicroAdam | Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence | github | paper(cite) |
| Muon | MomentUm Orthogonalized by Newton-schulz | github | paper(cite) |
| LaProp | Separating Momentum and Adaptivity in Adam | github | paper(cite) |
| APOLLO | SGD-like Memory, AdamW-level Performance | github | paper(cite) |
| MARS | Unleashing the Power of Variance Reduction for Training Large Models | github | paper(cite) |
| SGDSaI | No More Adam: Learning Rate Scaling at Initialization is All You Need | github | paper(cite) |
| Grams | Gradient Descent with Adaptive Momentum Scaling | paper(cite) | |
| OrthoGrad | Grokking at the Edge of Numerical Stability | github | paper(cite) |
| Adam-ATAN2 | Scaling Exponents Across Parameterizations and Optimizers | paper(cite) | |
| SPAM | Spike-Aware Adam with Momentum Reset for Stable LLM Training | github | paper(cite) |
| TAM | Torque-Aware Momentum | paper(cite) | |
| FOCUS | First Order Concentrated Updating Scheme | github | paper(cite) |
| PSGD | Preconditioned Stochastic Gradient Descent | github | paper(cite) |
| EXAdam | The Power of Adaptive Cross-Moments | github | paper(cite) |
| GCSAM | Gradient Centralized Sharpness Aware Minimization | github | paper(cite) |
| LookSAM | Towards Efficient and Scalable Sharpness-Aware Minimization | github | paper(cite) |
| SCION | Training Deep Learning Models with Norm-Constrained LMOs | github | paper(cite) |
| COSMOS | SOAP with Muon | github | |
| StableSPAM | How to Train in 4-Bit More Stably than 16-Bit Adam | github | paper |
| AdaGC | Improving Training Stability for Large Language Model Pretraining | paper(cite) | |
| Simplified-Ademamix | Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants | github | paper(cite) |
| Fira | Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint? | github | paper(cite) |
| RACS & Alice | Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension | paper(cite) | |
| VSGD | Variational Stochastic Gradient Descent for Deep Neural Networks | github | paper(cite) |
| SNSM | Subset-Norm and Subspace-Momentum: Faster Memory-Efficient Adaptive Optimization with Convergence Guarantees | github | paper(cite) |
| AdamC | Why Gradients Rapidly Increase Near the End of Training | paper(cite) | |
| AdaMuon | Adaptive Muon Optimizer | paper(cite) | |
| SPlus | A Stable Whitening Optimizer for Efficient Neural Network Training | github | paper(cite) |
| EmoNavi | An emotion-driven optimizer that feels loss and navigates accordingly | github | |
| Refined Schedule-Free | Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training | paper(cite) | |
| FriendlySAM | Friendly Sharpness-Aware Minimization | github | paper(cite) |
| AdaGO | AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates | paper(cite) | |
| Conda | Column-Normalized Adam for Training Large Language Models Faster | github | paper(cite) |
| BCOS | Stochastic Approximation with Block Coordinate Optimal Stepsizes | github | paper(cite) |
| Cautious WD | Cautious Weight Decay | paper(cite) | |
| Ano | Faster is Better in Noisy Landscape | github | paper(cite) |
| Spectral Sphere | Controlled LLM Training on Spectral Sphere | github | paper(cite) |
You can check the supported learning rate schedulers with below code.
from pytorch_optimizer import get_supported_lr_schedulers
supported_lr_schedulers = get_supported_lr_schedulers()or you can also search them with the filter(s).
from pytorch_optimizer import get_supported_lr_schedulers
get_supported_lr_schedulers('cosine*')
# ['cosine', 'cosine_annealing', 'cosine_annealing_with_warm_restart', 'cosine_annealing_with_warmup']
get_supported_lr_schedulers(['cosine*', '*warm*'])
# ['cosine', 'cosine_annealing', 'cosine_annealing_with_warm_restart', 'cosine_annealing_with_warmup', 'warmup_stable_decay']| LR Scheduler | Description | Official Code | Paper(Citation) |
|---|---|---|---|
| Explore-Exploit | Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule | paper(cite) | |
| Chebyshev | Acceleration via Fractal Learning Rate Schedules | paper(cite) | |
| REX | Revisiting Budgeted Training with an Improved Schedule | github | paper(cite) |
| WSD | Warmup-Stable-Decay learning rate scheduler | github | paper(cite) |
You can check the supported loss functions with below code.
from pytorch_optimizer import get_supported_loss_functions
supported_loss_functions = get_supported_loss_functions()or you can also search them with the filter(s).
from pytorch_optimizer import get_supported_loss_functions
get_supported_loss_functions('*focal*')
# ['bcefocalloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']
get_supported_loss_functions(['*focal*', 'bce*'])
# ['bcefocalloss', 'bceloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']| Loss Functions | Description | Official Code | Paper(Citation) |
|---|---|---|---|
| Label Smoothing | Rethinking the Inception Architecture for Computer Vision | paper(cite) | |
| Focal | Focal Loss for Dense Object Detection | paper(cite) | |
| Focal Cosine | Data-Efficient Deep Learning Method for Image Classification Using Data Augmentation, Focal Cosine Loss, and Ensemble | paper(cite) | |
| LDAM | Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss | github | paper(cite) |
| Jaccard (IOU) | IoU Loss for 2D/3D Object Detection | paper(cite) | |
| Bi-Tempered | The Principle of Unchanged Optimality in Reinforcement Learning Generalization | paper(cite) | |
| Tversky | Tversky loss function for image segmentation using 3D fully convolutional deep networks | paper(cite) | |
| Lovasz Hinge | A tractable surrogate for the optimization of the intersection-over-union measure in neural networks | github | paper(cite) |
- Stable docs: https://pytorch-optimizers.readthedocs.io/en/stable/
- Latest docs: https://pytorch-optimizers.readthedocs.io/en/latest/
- Optimizer API reference: https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/
- LR scheduler API reference: https://pytorch-optimizers.readthedocs.io/en/latest/lr_scheduler/
- Loss API reference: https://pytorch-optimizers.readthedocs.io/en/latest/loss/
- FAQ: https://pytorch-optimizers.readthedocs.io/en/latest/qa/
- Visualization examples: https://pytorch-optimizers.readthedocs.io/en/latest/visualization/
- Repository docs source: docs/optimizer.md, docs/lr_scheduler.md, docs/loss.md
Most implementations are under MIT or Apache 2.0 compatible terms from their original sources.
Some algorithms (for example Fromage, Nero) are tied to CC BY-NC-SA 4.0, which is non-commercial.
Please verify the license of each optimizer before production or commercial use.
- Contributing guide: CONTRIBUTING.md
- Code of conduct: CODE_OF_CONDUCT.md
- Security policy: SECURITY.md
- Changelog: CHANGELOG.md
Please cite original optimizer authors when you use specific algorithms. If you use this repository, you can use the citation metadata in CITATION.cff or GitHub's "Cite this repository".
@software{Kim_pytorch_optimizer_optimizer_2021,
author = {Kim, Hyeongchan},
title = {{pytorch_optimizer: optimizer & lr scheduler & loss function collections in PyTorch}},
url = {https://github.com/kozistr/pytorch_optimizer},
year = {2021}
}Hyeongchan Kim / @kozistr