Skip to content

SENATOROVAI/stochastic-average-gradient-sag-saga-solver-course

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Stochastic Average Gradient (SAG) & SAGA Solver Course

Variance Reduction Optimization for Large-Scale Machine Learning

License Python Website PRs Welcome DOI Code Style Pre-commit


πŸš€ About This Repository

This project provides a professional and mathematically rigorous explanation of:

  • Stochastic Average Gradient (SAG) algorithm
  • SAGA optimization algorithm
  • Variance reduction methods
  • Large-scale convex optimization
  • Sparse machine learning solvers
  • Regularized regression optimization

This repository is designed for:

  • Data Scientists
  • Machine Learning Engineers
  • Optimization Researchers
  • Students studying empirical risk minimization

πŸ”Ž Keywords (for GitHub search & Google indexing)

stochastic average gradient, sag algorithm, saga optimization, variance reduction methods, convex optimization solver, sparse machine learning optimization, large scale optimization, ridge regression solver, lasso regression solver, empirical risk minimization algorithm


πŸ“š Empirical Risk Minimization (ERM)

We solve the optimization problem:

$$ \min_{\theta \in \mathbb{R}^d} F(\theta) \frac{1}{n} \sum_{i=1}^{n} f_i(\theta) $$

Where:

  • $$\theta$$ β€” model parameters
  • $$n$$ β€” number of samples
  • $$f_i(\theta)$$ β€” loss of sample $$i$$

Example (Ridge regression):

$$ f_i(\theta) \frac{1}{2}(x_i^T\theta - y_i)^2+ \frac{\lambda}{2}|\theta|^2 $$


πŸ”΅ Stochastic Gradient Descent (Baseline)

Standard SGD update:

$$ \theta_{k+1}= \theta_k+ \eta \nabla f_{i_k}(\theta_k) $$

Issues:

  • High gradient variance
  • Slow convergence
  • Sublinear rate

πŸ”΅ Stochastic Average Gradient (SAG)

SAG stores gradients for all samples.

Let:

$$ g_i^k = \text{stored gradient for sample } i $$

Update rule:

$$ \theta_{k+1} \theta_k \eta \frac{1}{n} \sum_{i=1}^{n} g_i^k $$

At each iteration:

  1. Sample index $$i_k$$
  2. Compute new gradient
  3. Replace stored gradient

Key property:

$$ \frac{1}{n}\sum g_i^k \approx \nabla F(\theta_k) $$

Variance decreases over time.


πŸ”΅ SAGA Algorithm (Unbiased Variant)

SAGA improves SAG by removing bias:

$$ \theta_{k+1} \theta_k \eta \left ( \nabla f_{i_k}(\theta_k) \right) $$

$$ g_{i_k}^k + \left ( \frac{1}{n} \sum_{i=1}^{n} g_i^k \right) $$

Properties:

  • Unbiased gradient estimator
  • Supports composite objectives
  • Works with L1 regularization
  • Linear convergence for strongly convex problems

🧠 Convergence Theory

Assume:

  • $$F(\theta)$$ is $$\mu$$-strongly convex
  • Gradient is $$L$$-Lipschitz continuous

Condition number:

$$ \kappa = \frac{L}{\mu} $$

Then SAG / SAGA achieve:

$$ \mathcal{O} \left( (n + \kappa) \log\frac{1}{\epsilon} \right) $$

Compared to SGD:

$$ \mathcal{O} \left( \frac{1}{\epsilon} \right) $$

This explains why variance reduction methods dominate in large-scale convex optimization.


πŸ“‰ Why Variance Reduction Works

SGD gradient variance:

$$ \mathrm{Var}(\nabla f_{i_k}) $$

SAG/SAGA reduce variance because:

$$ \lim_{k \to \infty} \mathrm{Var} \left( \frac{1}{n}\sum g_i^k \right) 0 $$

This yields linear convergence rate.


πŸ— Project Structure

stochastic-average-gradient-sag-solver-course/
β”‚
β”œβ”€β”€ README.md
β”œβ”€β”€ LICENSE
β”œβ”€β”€ requirements.txt
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ sag.py
β”‚   β”œβ”€β”€ saga.py
β”‚   β”œβ”€β”€ loss_functions.py
β”‚
β”œβ”€β”€ examples/
β”‚   └── demo.py
β”‚
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ theory.md
β”‚   β”œβ”€β”€ convergence.md
β”‚
└── index.html

Clean structure improves:

  • Academic credibility
  • Search visibility
  • Portfolio quality

🐍 Minimal SAGA Implementation (Educational)

import numpy as np

class SAGA:
    def __init__(self, X, y, lr=0.01):
        self.X = X
        self.y = y
        self.lr = lr
        self.n, self.d = X.shape
        self.theta = np.zeros(self.d)
        self.grad_memory = np.zeros((self.n, self.d))
        self.grad_avg = np.zeros(self.d)

    def step(self):
        i = np.random.randint(0, self.n)
        grad = self.compute_gradient(i)

        self.theta -= self.lr * (
            grad
            - self.grad_memory[i]
            + self.grad_avg
        )

        self.grad_avg += (grad - self.grad_memory[i]) / self.n
        self.grad_memory[i] = grad

    def compute_gradient(self, i):
        xi = self.X[i]
        yi = self.y[i]
        return xi * (xi @ self.theta - yi)

πŸš€ Installation

pip install -r requirements.txt

Run demo:

python examples/demo.py

πŸ“Œ Applications

  • Logistic regression optimization
  • Ridge regression solver
  • Lasso regression solver
  • Sparse machine learning models
  • Large-scale convex optimization
  • Industrial ML systems

πŸ“– Related Topics

  • Stochastic Gradient Descent (SGD)
  • SVRG
  • L-BFGS
  • Conjugate Gradient
  • Variance Reduction Methods
  • Convex Optimization

If you are studying stochastic optimization algorithms for machine learning, this repository provides both theoretical foundations and practical implementation of SAG and SAGA solvers.

⭐ Star the repository if it helps your learning or research.

About

The SAG (Stochastic Average Gradient) + SAGA (Accelerated) solver is an optimization algorithm used primarily in machine learning, specifically for logistic regression and linear support vector machines (SVMs) within libraries like scikit-learn. It is designed to be highly efficient for large datasets with many samples and features. Solver

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages