module4.html

<!DOCTYPE html>
<html>
  <head>
    <title>Optimization for deep learning [Marc Lelarge]</title>
   <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
   <link rel="stylesheet" href="./assets/katex.min.css">  
   <link rel="stylesheet" type="text/css" href="./assets/slides.css">
   <link rel="stylesheet" type="text/css" href="./assets/grid.css">    
  </head>
  <body>

<textarea id="source">


class: center, middle, title-slide
count: false

# Module 4:

## Optimization for deep learning

<br/><br/>
.bold[Marc Lelarge]


---

# (1) Optimization and deep learning

## Gradient descent, stochastic gradient descent and mini-batch SGD

# (2) Gradient descent optimization algorithms

## Momentum, Nesterov accelerated gradient, Adagrad, RMSProp, Adam, AMSGrad

# (3) PyTorch optimizers


---

# (1) Some warnings about optimization in deep learning?

The objective function of an optimization algorithm is usually a loss function based on the training data set, hence the goal of optimization is to reduce the _training error_.

--
count: false

However, the goal of (deep) learning is to reduce the _generalization error_.

--
count: false

In order to reduce the generalization error, we need to pay attention to _overfitting_ in addition to using the optimization algorithm to reduce the training error.

--
count: false

In this course, we focus specifically on the _performance_ of the optimization algorithm in minimizing the objective function, rather than the model’s generalization error.

In the next lessons, we will see techniques to avoid _overfitting_.

--
count: false

No theorem in this lecture (for theorems, see a convex optimization course). We focus on intuitions because deep learning optimization problems are not convex and we hope our _intuition_ built on convex problems will be useful!

---
# (1) What do we optimize?

Recall the simple linear regression where the goal is to minimize the .bold[cost function]:

$$
J(\theta) = \frac{1}{2}\sum\_{i=1}^m(y(i)-\theta^T x(i))^2.
$$

--
count: false

For a deep neural network $F(.;\theta)$ with parameters $\theta$, we minimize the objective function:

$$
J(\theta) = \frac{1}{2}\sum\_{i=1}^m(y(i)-F(x(i);\theta))^2,
$$
in the paremter $\theta$ on the _training set_.

This problem is not convex anymore but we can still compute the gradient of the cost function $\nabla J(\theta)$ with respect to the parameters (well PyTorch do it for us!).

---

# (1) Gradient descent variants

Idea: update the parameters in the opposite direction of the gradient.

The _learning rate_ $\eta$ determines the size of the steps.

--
count: false

## Batch gradient descent
$$
\theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t)
$$

--
count: false

## Stochastic gradient descent
$$
\theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t;x(i),y(i))
$$

--
count: false

## Mini-batch gradient descent
$$
\theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t;x(i:i+b),y(i:i+b)),
$$
where $b$ is the batch size.

---

# (1) Challenges

Mini-batch gradient descent is the algorithm of choice when training a neural network. The term SGD is usually employed also when mini-batches are used!

--
count: false

- Choosing a proper learning rate can be difficult. How to adapt the learning rate during training?

- Why applying the same learning rate to all parameters updates?

- How to escape saddle points where the gradient is close to zero in all dimension?

.center[
          <img src="images/module4/Saddle_point.png" style="width: 400px;" />
]

---
count:false

# (1) Challenges

Mini-batch gradient descent is the algorithm of choice when training a neural network. The term SGD is usually employed also when mini-batches are used!


- Choosing a proper learning rate can be difficult. How to adapt the learning rate during training?

- Why applying the same learning rate to all parameters updates?

- How to escape saddle points where the gradient is close to zero in all dimension?


In the rest of the lecture, we will introduce _modifications to SGD_.

Also we will apply these modifications to mini-batch gradient descent, we omit the explicit reference to the batch $x(i:i+b),y(i:i+b)$. Hence we write for SGD:
$$
\theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t),
$$
but we need to keep in mind that updates are performed for every mini-batch of $b$ training samples.


---

# (2) Gradient descent optimization algorithms


A nice [survey](http://ruder.io/optimizing-gradient-descent/) by Sebastian Ruder

Please run the [jupyter notebook](https://colab.research.google.com/github/dataflowr/notebooks/blob/master/Module4/04_gradient_descent_optimization_algorithms_empty.ipynb) in parallel as you will need to code the optimization algorithms as soon as we see them.

We will deal with the toy problem: minimize in $x_1,x_2$, the function:
$$
f(x\_1, x\_2) = 0.1 x\_1^2+2 x\_2^2.
$$

--
count: false

You will need to implement all variations of SGD.

Have a look at the `train_2d()` function and the Gradient descent code to understand what this function does. 

---

# (2) Momentum


Accelerating SGD by dampening oscillations, i.e. by averaging the last values of the gradient.

--
count: false


$$\begin{aligned}
v\_{t+1} &= \gamma v\_{t}+ \eta \nabla J(\theta\_t)\\\\
\theta\_{t+1} &= \theta\_t - v\_{t+1}
\end{aligned}$$

--
count: false

Why does it work?

With $g\_t = \nabla J(\theta\_t)$, we have for any $k>0$:
$$
v\_{t+1} = \gamma^k v\_{t-k} +\eta \underbrace{\sum\_{i=0}^k \gamma^i g\_{t-i}}_{\text{average of last gradients}}
$$

Typical value for $\gamma= 0.9$.


---
# (2) Nesterov accelerated gradient

With momentum, we first compute the gradient and then make a step following our momentum and add the gradient. Nesterov proposed to first make the step following the momentum and then adjusting by computing the gradient locally:

$$\begin{aligned}
v\_{t+1} &= \gamma v\_{t}+ \eta \nabla J(\theta\_t-\gamma v\_{t})\\\\
\theta\_{t+1} &= \theta\_t - v\_{t+1}
\end{aligned}$$

.center[
          <img src="images/module4/nesterov_update_vector.png" style="width: 400px;" />
]

Source for the image: [G. Hinton's lecture 6c](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)
.citation[Nesterov, A method for solving the convex programming problem with convergence rate O(1/k^2), Dokl. Akad. Nauk SSSR 1983]
---

# (2) Adagrad

We would like to adapt our updates to each individual parameter, i.e. have a different decreasing learning rate for each parameter.


$$\begin{aligned}
s\_{t+1,i} &= s\_{t,i} + \nabla J(\theta\_t)\_i^2\\\\
\theta\_{t+1,i} &= \theta\_{t,i} - \frac{\eta}{\sqrt{s\_{t+1,i}+\epsilon}}\nabla J(\theta\_t)\_i
\end{aligned}$$

--
count: false

No  manual tuning of the learning rate.

Typical default values: $\eta=0.01$ and $\epsilon = 1e-8$.

.citation[Duchi et al., [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](http://jmlr.org/papers/v12/duchi11a.html), JMLR 2011]

---

# (2) RMSProp

Problem with Adagrad, learning rate goes to zero and never forgets about the past.

--
count: false

Idea proposed by G. Hinton in his coursera class: use exponential average.

$$\begin{aligned}
s\_{t+1,i} &= \gamma s\_{t,i} + (1-\gamma) \nabla J(\theta\_t)\_i^2\\\\
\theta\_{t+1,i} &= \theta\_{t,i} - \frac{\eta}{\sqrt{s\_{t+1,i}+\epsilon}}\nabla J(\theta\_t)\_i
\end{aligned}$$


--
count: false


With a slight abuse of notation, we re-write the update as follows:
$$\begin{aligned}
s\_{t+1} &= \gamma s\_{t} + (1-\gamma) \nabla J(\theta\_t)^2\\\\
\theta\_{t+1} &= \theta\_{t} - \frac{\eta}{\sqrt{s\_{t+1}+\epsilon}}\nabla J(\theta\_t)
\end{aligned}$$

Typical values: $\gamma = 0.9$ and $\eta = 0.001$.


.citation[Hinton [Coursera lecture 6](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) ]

---

# (2) Adam

Mixing ideas from RMSProp and momentum, we get Adam = Adaptive moment Estimation. 

$$\begin{aligned}
m\_{t+1} &= \beta\_1 m\_t + (1-\beta\_1) \nabla J(\theta\_t)\\\\
v\_{t+1} &= \beta\_2 v\_t + (1-\beta\_2) \nabla J(\theta\_t)^2\\\\
\hat{m}\_{t+1} &= \frac{m\_{t+1}}{1-\beta\_1^{t+1}}\\\\
\hat{v}\_{t+1} &= \frac{v\_{t+1}}{1-\beta\_2^{t+1}}\\\\
\theta\_{t+1} &= \theta\_{t} - \frac{\eta}{\sqrt{\hat{v}\_{t+1}}+\epsilon} \hat{m}\_{t+1}
\end{aligned}$$

$\hat{m}\_t$ and $\hat{v}\_t$ are estimates for the first and second moments of the gradients. Because $m_0=v_0=0$, these estimates are biased towards $0$, the factors $(1-\beta^{t+1})^{-1}$ are here to counteract these biases.
--
count: false

Typical values: $\beta_1=0.9$, $\beta_2 =0.999$ and $\epsilon=1e-8$.

.citation[Kingma et al. , [Adam: a Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980), ICLR 2015]

---

# (2) AMSGrad

Sometimes, Adam forgets too fast, hence to fix it, we replace the moving average by a $\max$:

$$\begin{aligned}
m\_{t+1} &= \beta\_1 m\_t + (1-\beta\_1) \nabla J(\theta\_t)\\\\
v\_{t+1} &= \beta\_2 v\_t + (1-\beta\_2) \nabla J(\theta\_t)^2\\\\
\hat{v}\_{t+1} &= \max\left(\hat{v}\_{t}, v\_{t+1}\right)\\\\
\theta\_{t+1} &= \theta\_{t} - \frac{\eta}{\sqrt{\hat{v}\_{t+1}}+\epsilon} m\_{t+1}
\end{aligned}$$

.citation[Reddi et al.,[On the Convergence of Adam and Beyond](https://openreview.net/forum?id=ryQu7f-RZ) ICLR 2018]
---

# (3) [PyTorch optimizers](https://pytorch.org/docs/stable/optim.html)

All have similar constructor `torch.optim.*(params, lr=..., momentum=...)`.
Default values are different for all optimizer, check the doc.

`params` should be an iterable (like a list) containing the parameters to optimize over.
It can be obtained from any module with `module.parameters()`.

The `step` method updates the internal state of the optimizer according to the `grad` attributes of the `params`, and updates the latter according to the internal state.

```py
criterion = nn.NLLLoss()
optimizer_vgg = torch.optim.SGD(model_vgg.classifier[6].parameters(),lr = 0.001)

def train_model(model,dataloader,size,epochs=1,optimizer=None):
    model.train()
    for epoch in range(epochs):
        for inputs,classes in dataloader:
            inputs = inputs.to(device)
            classes = classes.to(device)
            outputs = model(inputs)
            loss = criterion(outputs,classes)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
```


---
class: end-slide, center
count: false

The end.


</textarea> 
  <script src="./assets/remark-latest.min.js"></script>
  <script src="./assets/auto-render.min.js"></script>
  <script src="./assets/katex.min.js"></script>
  <script type="text/javascript">
      function getParameterByName(name, url) {
          if (!url) url = window.location.href;
          name = name.replace(/[\[\]]/g, "\\$&");
          var regex = new RegExp("[?&]" + name + "(=([^&#]*)|&|#|$)"),
              results = regex.exec(url);
          if (!results) return null;
          if (!results[2]) return '';
          return decodeURIComponent(results[2].replace(/\+/g, " "));
      }

      //var options = {sourceUrl: getParameterByName("p"),
      //               highlightLanguage: "python",
      //               // highlightStyle: "tomorrow",
      //               // highlightStyle: "default",
      //               highlightStyle: "github",
      //               // highlightStyle: "googlecode",
      //               // highlightStyle: "zenburn",
      //               highlightSpans: true,
      //               highlightLines: true,
      //               ratio: "16:9"};
      var options = {sourceUrl: getParameterByName("p"),
                     highlightLanguage: "python",
                     highlightStyle: "github",
                     highlightSpans: true,
                     highlightLines: true,
                     ratio: "16:9",
                     slideNumberFormat: (current, total) => `
        <div class="progress-bar-container">${current}/${total}&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br/><br/>
          <div class="progress-bar" style="width: ${current/total*100}%"></div>
        </div>
      `};

      var renderMath = function() {
          renderMathInElement(document.body, {delimiters: [ // mind the order of delimiters(!?)
              {left: "$$", right: "$$", display: true},
              {left: "$", right: "$", display: false},
              {left: "\\[", right: "\\]", display: true},
              {left: "\\(", right: "\\)", display: false},
          ]});
      }
    var slideshow = remark.create(options, renderMath);
  </script>
  </body>
</html>